cao24b@interspeech_2024@ISCA

Total: 1

#1 A Framework for Phoneme-Level Pronunciation Assessment Using CTC [PDF] [Copy] [Kimi] [REL]

Authors: Xinwei Cao, Zijian Fan, Torbjørn Svendsen, Giampiero Salvi

Traditional phoneme-level goodness of pronunciation (GOP) methods require phoneme to speech alignment. The drawback is that these methods, by their definitions, are prone to alignment errors and preclude the possibility of deletion and insertion errors in pronunciation. We produce experimental evidence that CTC-based methods can be used in traditional GOP estimation in spite of their “peaky” output behaviour and may be less prone to alignment errors than traditional methods. We also propose a new framework for GOP estimation based on CTC-trained model that is independent of speech-phoneme alignment. By accounting for deletion and insertions as well as substitution errors, we show that our framework outperform alignment-based method. Our experimental results are based on the CMU-kids dataset for child speech and on the Speechocean762 containing both child and adult speech speakers. Our best method achieves 29.02% relative improvement over the baseline GOP methods.