A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

#1 A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition [PDF⁴] [Copy] [Kimi²] [REL]

Authors: Yangze Li, Xiong Wang, Songjun Cao, Yike Zhang, Long Ma, Lei Xie

Audio-LLM introduces audio modality into a large language model (LLM) to enable a powerful LLM to recognize, understand, and generate audio. However, during speech recognition in noisy environments, we observed the presence of illusions and repetition issues in audio-LLM, leading to substitution and insertion errors. This paper proposes a transcription prompt-based audio-LLM by introducing an ASR expert as a transcription tokenizer and a hybrid Autoregressive (AR) Non-autoregressive (NAR) decoding approach to solve the above problems. Experiments on 10k-hour WenetSpeech Mandarin corpus show that our approach decreases 12.2% and 9.6% CER relatively on Test_Net and Test_Meeting evaluation sets compared with baseline. Notably, we reduce the decoding repetition rate on the evaluation set to zero, showing that the decoding repetition problem has been solved fundamentally.

Subjects: Sound , Audio and Speech Processing

Publish: 2024-08-18 14:10:35 UTC

2408.09491

#1 A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition [PDF4] [Copy] [Kimi2] [REL]

#1 A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition [PDF⁴] [Copy] [Kimi²] [REL]