misra21@interspeech_2021@ISCA

Total: 1

#1 A Comparison of Supervised and Unsupervised Pre-Training of End-to-End Models [PDF] [Copy] [Kimi1]

Authors: Ananya Misra ; Dongseong Hwang ; Zhouyuan Huo ; Shefali Garg ; Nikhil Siddhartha ; Arun Narayanan ; Khe Chai Sim

In the absence of large-scale in-domain supervised training data, ASR models can achieve reasonable performance through pre-training on additional data that is unlabeled, mismatched or both. Given such data constraints, we compare pre-training end-to-end models on matched but unlabeled data (unsupervised) and on labeled but mismatched data (supervised), where the labeled data is mismatched in either domain or language. Across encoder architectures, pre-training methods and languages, our experiments indicate that both types of pre-training improve performance, with relative WER reductions of 15–30% in the domain mismatch case and up to 15% in the language mismatch condition. We further find that the advantage from unsupervised pre-training is most prominent when there is no matched and labeled fine-tuning data, provided that a sufficient amount of mismatched data is still available for supervised fine-tuning.