Diff-VS: Efficient Audio-Aware Diffusion U-Net for Vocals Separation

#1 Diff-VS: Efficient Audio-Aware Diffusion U-Net for Vocals Separation [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Yun-Ning, Hung, Richard Vogl, Filip Korzeniowski, Igor Pereira

While diffusion models are best known for their performance in generative tasks, they have also been successfully applied to many other tasks, including audio source separation. However, current generative approaches to music source separation often underperform on standard objective metrics. In this paper, we address this issue by introducing a novel generative vocal separation model based on the Elucidated Diffusion Model (EDM) framework. Our model processes complex short-time Fourier transform spectrograms and employs an improved U-Net architecture based on music-informed design choices. Our approach matches discriminative baselines on objective metrics and achieves perceptual quality comparable to state-of-the-art systems, as assessed by proxy subjective metrics. We hope these results encourage broader exploration of generative methods for music source separation

Subject: Audio and Speech Processing

Publish: 2026-04-01 16:44:19 UTC

2604.01120

#1 Diff-VS: Efficient Audio-Aware Diffusion U-Net for Vocals Separation [PDF1] [Copy] [Kimi1] [REL]

#1 Diff-VS: Efficient Audio-Aware Diffusion U-Net for Vocals Separation [PDF¹] [Copy] [Kimi¹] [REL]