getman25@interspeech_2025@ISCA

Total: 1

#1 Is your model big enough? Training and interpreting large-scale monolingual speech foundation models [PDF] [Copy] [Kimi] [REL]

Authors: Yaroslav Getman, Tamás Grósz, Tommi Lehtonen, Mikko Kurimo

Self-supervised learning has been widely used in developing speech foundation models. Most languages, however, are only represented in multilingual foundations. We introduce monolingual self-supervised foundation models pre-trained on more than 150,000 hours of Finnish speech and propose a new interpretation technique to understand their capabilities. To our knowledge, this is the largest monolingual data used for self-supervised non-English speech representation learning. Our models demonstrate superior downstream low-resource ASR performance and improved generalization compared to prior work, with absolute WER reductions of up to 14%. Moreover, our proposed interpretation technique, Layer Utilization Rate (LUR), enables us to assess the percentage of neurons in each layer highly contributing towards the output. Empirical results show that the proposed LUR metric can be used to indicate the potential of the fine-tuned model's size and architecture to generalize to unseen domains.

Subject: INTERSPEECH.2025 - Modelling and Learning