Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet

#1 Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet [PDF³] [Copy] [Kimi²] [REL]

Authors: Anyu Ying, Natarajan Balaji Shankar, Chyi-Jiunn Lin, Mohan Shi, Pu Wang, Hye-jin Shim, Siddhant Arora, Hugo Van hamme, Abeer Alwan, Shinji Watanabe

Despite advancements in ASR, child speech recognition remains challenging due to acoustic variability and limited annotated data. While fine-tuning adult ASR models on child speech is common, comparisons with flat-start training remain underexplored. We compare flat-start training across multiple datasets, SSL representations (WavLM, XEUS), and decoder architectures. Our results show that SSL representations are biased toward adult speech, with flat-start training on child speech mitigating these biases. We also analyze model scaling, finding consistent improvements up to 1B parameters, beyond which performance plateaus. Additionally, age-related ASR and speaker verification analysis highlights the limitations of proprietary models like Whisper, emphasizing the need for open-data models for reliable child speech research. All investigations are conducted using ESPnet, and our publicly available benchmark provides insights into training strategies for robust child speech processing.

Subject: Machine Learning

Publish: 2025-08-22 17:59:35 UTC

2508.16576

#1 Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet [PDF3] [Copy] [Kimi2] [REL]

#1 Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet [PDF³] [Copy] [Kimi²] [REL]