Dual-Mode NAM: Effective Top-K Context Injection for End-to-End ASR

#1 Dual-Mode NAM: Effective Top-K Context Injection for End-to-End ASR [PDF] [Copy] [Kimi¹] [REL]

Authors: Zelin Wu, Tsendsuren Munkhdalai, Pat Rondon, Golan Pundak, Khe Chai Sim, Christopher Li

ASR systems in real applications must be adapted on the fly to correctly recognize task-specific contextual terms, such as contacts, application names and media entities. However, it is challenging to achieve scalability, large in-domain quality gains, and minimal out-of-domain quality regressions simultaneously. In this work, we introduce an effective neural biasing architecture called Dual-Mode NAM. Dual-Mode NAM embeds a top-k search process in its attention mechanism in a trainable fashion to perform an accurate top-k phrase selection before injecting the corresponding word-piece context into the acoustic encoder. We further propose a controllable mechanism to enable the ASR system to be able to trade off its in-domain and out-of-domain quality at inference time. When evaluated on a large-scale biasing benchmark, the combined techniques improve a previously proposed method with an average in-domain and out-of-domain WER reduction by up to 53.3% and 12.0% relative respectively.

Subject: INTERSPEECH.2023 - Speech Recognition

wu23e@interspeech_2023@ISCA

#1 Dual-Mode NAM: Effective Top-K Context Injection for End-to-End ASR [PDF] [Copy] [Kimi1] [REL]

#1 Dual-Mode NAM: Effective Top-K Context Injection for End-to-End ASR [PDF] [Copy] [Kimi¹] [REL]