Disentangling Speaker and Content in Pre-trained Speech Models with Latent Diffusion for Robust Speaker Verification

#1 Disentangling Speaker and Content in Pre-trained Speech Models with Latent Diffusion for Robust Speaker Verification [PDF¹] [Copy] [Kimi] [REL]

Authors: Zhe Li, Man-Wai Mak, Jen-Tzung Chien, Mert Pilanci, Zezhong Jin, Helen Meng

Disentangled speech representation learning for speaker verification aims to separate spoken content and speaker timbre into distinct representations. However, existing variational autoencoder (VAE)--based methods for speech disentanglement rely on latent variables that lack semantic meaning, limiting their effectiveness for speaker verification. To address this limitation, we propose a diffusion-based method that disentangles and separates speaker features and speech content in the latent space. Building upon the VAE framework, we employ a speaker encoder to learn latent variables representing speaker features while using frame-specific latent variables to capture content. Unlike previous sequential VAE approaches, our method utilizes a conditional diffusion model in the latent space to derive speaker-aware representations. Experiments on the VoxCeleb datasets demonstrate that our method effectively isolates speaker features from speech content using pre-trained speech

Subject: INTERSPEECH.2025 - Speech Detection

li25z@interspeech_2025@ISCA

#1 Disentangling Speaker and Content in Pre-trained Speech Models with Latent Diffusion for Robust Speaker Verification [PDF1] [Copy] [Kimi] [REL]

#1 Disentangling Speaker and Content in Pre-trained Speech Models with Latent Diffusion for Robust Speaker Verification [PDF¹] [Copy] [Kimi] [REL]