grosz23@interspeech_2023@ISCA

Total: 1

#1 Investigating wav2vec2 context representations and the effects of fine-tuning, a case-study of a Finnish model [PDF1] [Copy] [Kimi1]

Authors: Tamas Grosz ; Yaroslav Getman ; Ragheb Al-Ghezi ; Aku Rouhe ; Mikko Kurimo

Self-supervised speech models, such as the wav2vec2, have become extremely popular in the past few years. Their main appeal is that after their pre-training on a large amount of audio, they require only a small amount of supervised, finetuning data to achieve outstanding results. Despite their immense success, very little is understood about the pre-trained models and how finetuning changes them. In this work, we take the first steps towards a better understanding of wav2vec2 systems using model interpretation tools such as visualization and latent embedding clustering. Through our analysis, we gain new insights into the abilities of the pre-trained networks and the effect that finetuning has on them. We demonstrate that the clusters learned by the pre-trained model are just as important a factor as the supervised training data distribution in determining the accuracy of the finetuned system, which could aid us in selecting the most suitable pre-trained model for the supervised data.