wu23e@interspeech_2023@ISCA

Total: 1

#1 Dual-Mode NAM: Effective Top-K Context Injection for End-to-End ASR [PDF] [Copy] [Kimi1]

Authors: Zelin Wu ; Tsendsuren Munkhdalai ; Pat Rondon ; Golan Pundak ; Khe Chai Sim ; Christopher Li

ASR systems in real applications must be adapted on the fly to correctly recognize task-specific contextual terms, such as contacts, application names and media entities. However, it is challenging to achieve scalability, large in-domain quality gains, and minimal out-of-domain quality regressions simultaneously. In this work, we introduce an effective neural biasing architecture called Dual-Mode NAM. Dual-Mode NAM embeds a top-k search process in its attention mechanism in a trainable fashion to perform an accurate top-k phrase selection before injecting the corresponding word-piece context into the acoustic encoder. We further propose a controllable mechanism to enable the ASR system to be able to trade off its in-domain and out-of-domain quality at inference time. When evaluated on a large-scale biasing benchmark, the combined techniques improve a previously proposed method with an average in-domain and out-of-domain WER reduction by up to 53.3% and 12.0% relative respectively.