Towards Explainable Monaural Speaker Separation with Auditory-based Training

#1 Towards Explainable Monaural Speaker Separation with Auditory-based Training [PDF] [Copy] [Kimi] [REL]

Authors: Hassan Taherian, Vahid Ahmadi Kalkhorani, Ashutosh Pandey, Daniel Wong, Buye Xu, DeLiang Wang

Permutation ambiguity is a major challenge in training monaural talker-independent speaker separation. While permutation invariant training (PIT) is a widely used technique, it functions as a `black box', providing little insight into which auditory cues lead to successful training. We introduce a new approach to speaker separation by leveraging differences in pitch and onset, which are both prominent cues for auditory scene analysis. We propose pitch-based and onset-based training to resolve permutation ambiguity, assigning speakers by their pitch frequencies and onset times, respectively. This approach offers a more explainable training strategy than PIT. We also propose a hybrid criterion combining these cues to improve separation performance in challenging conditions such as same-gender speakers or close utterance onsets. Evaluation results show that pitch and onset criteria each perform competitively to PIT and the hybrid criterion surpasses PIT in separating two-speaker mixtures.

Subject: INTERSPEECH.2024 - Speech Processing

taherian24@interspeech_2024@ISCA

#1 Towards Explainable Monaural Speaker Separation with Auditory-based Training [PDF] [Copy] [Kimi] [REL]