OR-TSE: An Overlap-Robust Speaker Encoder for Target Speech Extraction

#1 OR-TSE: An Overlap-Robust Speaker Encoder for Target Speech Extraction [PDF²] [Copy] [Kimi] [REL]

Authors: Yiru Zhang, Linyu Yao, Qun Yang

Mainstream Target Speech Extraction (TSE) systems extract target speech from a mixture using pre-enrolled reference speech. The extraction performance heavily depends on the quality of the reference speech. However, the speech signal of the same speaker may vary under different conditions, leading to a decrease in extraction performance, particularly in speech overlap. Therefore, we propose an overlap robust speaker encoder for TSE to obtain stable speaker embeddings even when using signals with overlapping interference. Our approach combines attentive statistics pooling with contrastive learning to make the model focus on the feature of main speaker while disregarding interfering information. Based on our proposed speaker encoder, we introduce a TSE framework, which derive speaker embeddings from non-overlapping regions of mixture input. The experiments shows that our speaker encoder improves the performance of TSE in different conditions of reference speech.

Subject: INTERSPEECH.2024 - Speech Processing

zhang24p@interspeech_2024@ISCA

#1 OR-TSE: An Overlap-Robust Speaker Encoder for Target Speech Extraction [PDF2] [Copy] [Kimi] [REL]

#1 OR-TSE: An Overlap-Robust Speaker Encoder for Target Speech Extraction [PDF²] [Copy] [Kimi] [REL]