yang23e@interspeech_2023@ISCA

Total: 1

#1 Dual Acoustic Linguistic Self-supervised Representation Learning for Cross-Domain Speech Recognition [PDF3] [Copy] [Kimi3]

Authors: Zhao Yang ; Dianwen Ng ; Chong Zhang ; Xiao Fu ; Rui Jiang ; Wei Xi ; Yukun Ma ; Chongjia Ni ; Eng Siong Chng ; Bin Ma ; Jizhong Zhao

The integration of well-pre-trained acoustic and linguistic representations boosts the performance of speech-to-text cross-modality tasks. However, the potential of fine-tuning cross-modality integrated model on accented and noisy corpus is still under-explored. To address this gap, we propose an end-to-end acoustic and linguistic integrated representation learning model, namely Dual-w2v-BART. Our model incorporates acoustic representations from wav2vec2.0 and linguistic information from BART model by utilizing the cross-attention mechanism in the decoder, with paired speech-text dual inputs. To enhance model robustness on accent and noise, we propose a text-centric representation consistency component that helps to gain the similarity between different modality inputs while representing the same content. The results on accented and noisy speech recognition tasks demonstrate the effectiveness of the proposed model for reducing error rates compared to baseline and other competitive models.