A Comparison of Supervised and Unsupervised Pre-Training of End-to-End Models

#1 A Comparison of Supervised and Unsupervised Pre-Training of End-to-End Models [PDF] [Copy] [Kimi¹] [REL]

Authors: Ananya Misra, Dongseong Hwang, Zhouyuan Huo, Shefali Garg, Nikhil Siddhartha, Arun Narayanan, Khe Chai Sim

In the absence of large-scale in-domain supervised training data, ASR models can achieve reasonable performance through pre-training on additional data that is unlabeled, mismatched or both. Given such data constraints, we compare pre-training end-to-end models on matched but unlabeled data (unsupervised) and on labeled but mismatched data (supervised), where the labeled data is mismatched in either domain or language. Across encoder architectures, pre-training methods and languages, our experiments indicate that both types of pre-training improve performance, with relative WER reductions of 15–30% in the domain mismatch case and up to 15% in the language mismatch condition. We further find that the advantage from unsupervised pre-training is most prominent when there is no matched and labeled fine-tuning data, provided that a sufficient amount of mismatched data is still available for supervised fine-tuning.

Subject: INTERSPEECH.2021 - Speech Recognition

misra21@interspeech_2021@ISCA

#1 A Comparison of Supervised and Unsupervised Pre-Training of End-to-End Models [PDF] [Copy] [Kimi1] [REL]

#1 A Comparison of Supervised and Unsupervised Pre-Training of End-to-End Models [PDF] [Copy] [Kimi¹] [REL]