SEQ-former: A context-enhanced and efficient automatic speech recognition framework

#1 SEQ-former: A context-enhanced and efficient automatic speech recognition framework [PDF¹²] [Copy] [Kimi¹⁰] [REL]

Authors: Qinglin Meng, Min Liu, Kaixun Huang, Kun Wei, Lei Xie, Zongfeng Quan, Weihong Deng, Quan Lu, Ning Jiang, Guoqing Zhao

Contextual information is crucial for automatic speech recognition (ASR). Effective utilization of contextual information can improve the accuracy of ASR systems. To improve the model's ability to capture this information, we propose a novel ASR framework called SEQ-former, emphasizing simplicity, efficiency, and quickness. We incorporate a Prediction Decoder Network and a Shared Prediction Decoder Network to enhance contextual capabilities. To further increase efficiency, we use intermediate CTC and CTC Spike Reduce Methods to guide attention masks and reduce redundant peaks. Our approach demonstrates state-of-the-art performance on the AiShell-1 dataset, improves decoding efficiency, and delivers competitive results on LibriSpeech. Additionally, it optimizes 6.3% over 11,000 hours of private data compared to Efficient Conformer.

Subject: INTERSPEECH.2024 - Speech Recognition

meng24@interspeech_2024@ISCA

#1 SEQ-former: A context-enhanced and efficient automatic speech recognition framework [PDF12] [Copy] [Kimi10] [REL]

#1 SEQ-former: A context-enhanced and efficient automatic speech recognition framework [PDF¹²] [Copy] [Kimi¹⁰] [REL]