Unsupervised Single-Channel Speech Separation with a Diffusion Prior under Speaker-Embedding Guidance

#1 Unsupervised Single-Channel Speech Separation with a Diffusion Prior under Speaker-Embedding Guidance [PDF¹] [Copy] [Kimi] [REL]

Authors: Runwu Shi, Kai Li, Chang Li, Jiang Wang, Sihan Tan, Kazuhiro Nakadai

Speech separation is a fundamental task in audio processing, typically addressed with fully supervised systems trained on paired mixtures. While effective, such systems typically rely on synthetic data pipelines, which may not reflect real-world conditions. Instead, we revisit the source-model paradigm, training a diffusion generative model solely on anechoic speech and formulating separation as a diffusion inverse problem. However, unconditional diffusion models lack speaker-level conditioning, they can capture local acoustic structure but produce temporally inconsistent speaker identities in separated sources. To address this limitation, we propose Speaker-Embedding guidance that, during the reverse diffusion process, maintains speaker coherence within each separated track while driving embeddings of different speakers further apart. In addition, we propose a new separation-oriented solver tailored for speech separation, and both strategies effectively enhance performance on the challenging task of unsupervised source-model-based speech separation, as confirmed by extensive experimental results. Audio samples and code are available at https://runwushi.github.io/UnSepDiff_demo.

Subjects: Audio and Speech Processing , Sound

Publish: 2025-09-29 07:42:54 UTC

2509.24395

#1 Unsupervised Single-Channel Speech Separation with a Diffusion Prior under Speaker-Embedding Guidance [PDF1] [Copy] [Kimi] [REL]

#1 Unsupervised Single-Channel Speech Separation with a Diffusion Prior under Speaker-Embedding Guidance [PDF¹] [Copy] [Kimi] [REL]