zou24@interspeech_2024@ISCA

Total: 1

#1 E-Paraformer: A Faster and Better Parallel Transformer for Non-autoregressive End-to-End Mandarin Speech Recognition [PDF1] [Copy] [Kimi] [REL]

Authors: Kun Zou ; Fengyun Tan ; Ziyang Zhuang ; Chenfeng Miao ; Tao Wei ; Shaodan Zhai ; Zijian Li ; Wei Hu ; Shaojun Wang ; Jing Xiao

Paraformer is a powerful non-autoregressive (NAR) model for Mandarin speech recognition. It relies on Continuous Integrate-and-Fire (CIF) to implement parallel decoding. However, the CIF mechanism needs to recursively obtain the acoustic boundary of the emitted token, which will lead to inefficiency. In this paper, we introduce a novel monotonic alignment mechanism as an alternative to CIF that can convert frame-level embeddings into token-level embeddings in parallel. Combining this method with other improvements to the model structure, we design a faster and better parallel transformer called the Efficient Paraformer (E-Paraformer). Experiments are performed on the AISHELL-1 benchmark. Compared to Paraformer baseline, the E-Paraformer achieves character error rates (CER) of 4.36%/4.79% on the AISHELL-1 dev/test dataset, representing 7.8% and 6.3% (relative) reductions, respectively. Moreover, it achieves about 2x inference speedup and 1.35x training speedup.