SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech

#1 SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech [PDF] [Copy] [Kimi] [REL]

Authors: Jingru Lin, Meng Ge, Junyi Ao, Liqun Deng, Haizhou Li

It was shown that pre-trained models with self-supervised learning (SSL) techniques are effective in various downstream speech tasks. However, most such models are trained on single-speaker speech data, limiting their effectiveness in mixture speech. This motivates us to explore pre-training on mixture speech. This work presents SA-WavLM, a novel pre-trained model for mixture speech. Specifically, SA-WavLM follows an "extract-merge-predict" pipeline in which the representations of each speaker in the input mixture are first extracted individually and then merged before the final prediction. In this pipeline, SA-WavLM performs speaker-informed extractions with the consideration of the interactions between different speakers. Furthermore, a speaker shuffling strategy is proposed to enhance the robustness towards the speaker absence. Experiments show that SA-WavLM either matches or improves upon the state-of-the-art pre-trained models.

Subject: INTERSPEECH.2024 - Speech Processing

lin24g@interspeech_2024@ISCA

#1 SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech [PDF] [Copy] [Kimi] [REL]