yang24j@interspeech_2024@ISCA

Total: 1

#1 Contextual Biasing with Confidence-based Homophone Detector for Mandarin End-to-End Speech Recognition [PDF1] [Copy] [Kimi] [REL]

Authors: Chengxu Yang, Lin Zheng, Sanli Tian, Gaofeng Cheng, Sujie Xiao, Ta Li

Deep biasing methods and shallow fusion methods have been demonstrated to improve the performance of end-to-end ASR effectively. However, accurate recognition often becomes challenging when specific words within the contextual phrases occur too infrequently in the training corpus or are out-of-vocabulary. To address this issue, we introduce a confidence-based homophone detector and syllable bias model to correct context phrases that may have been recognized incorrectly. The detector utilizes confidence distribution peaks resulting from homophone substitutions in ASR decoding outputs and employs their coefficient of variation for discrimination to avoid loss of general performance. Experiments on the biased word subset of Aishell-1 show that our proposed method obtains a 31.2% relative CER improvement over the baseline and a relative decrease of 52.0% for context phrases. When cascaded with the deep fusion and shallow fusion methods, the improvements become 13.7% and 33.5% respectively.