Contextual Biasing Speech Recognition in Speech-enhanced Large Language Model

#1 Contextual Biasing Speech Recognition in Speech-enhanced Large Language Model [PDF³] [Copy] [Kimi³] [REL]

Authors: Xun Gong, Anqi Lv, Zhiming Wang, Yanmin Qian

Recently, the rapid advancements in audio- and speech-enhanced large language models (SpeechLLMs), such as Qwen-Audio and SALMONN, have significantly propelled automatic speech recognition (ASR) forward. However, despite the improvements in universal recognition capabilities, bias word recognition persists as a prominent challenge for SpeechLLM, and is not extensively studied. In this study, we introduce two contextual biasing strategies aimed at improving the bias word recognition of SpeechLLM. Firstly, we explored two types of biasing prompts for SpeechLLMs, achieving 10% relative reduction in bias word error rate (WER). However, as the size of the bias list increased, performance significantly declined due to hallucination. Subsequently, we built the biasing fusion network for SpeechLLM that integrates high-level bias embeddings with the SpeechLLM framework. Our experiments conducted on the LibriSpeech test-clean/-other datasets demonstrate that our method achieves up to 10%/35% relative reduction in overall/bias WER compared to our baseline.

Subject: INTERSPEECH.2024 - Speech Recognition

gong24b@interspeech_2024@ISCA

#1 Contextual Biasing Speech Recognition in Speech-enhanced Large Language Model [PDF3] [Copy] [Kimi3] [REL]

#1 Contextual Biasing Speech Recognition in Speech-enhanced Large Language Model [PDF³] [Copy] [Kimi³] [REL]