SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment

#1 SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment [PDF²] [Copy] [Kimi] [REL]

Authors: Shivam Mehta, Yingru Liu, Zhenyu Tang, Kainan Peng, Vimal Manohar, Shun Zhang, Mike Seltzer, Qing He, Mingbo Ma

Zero-shot voice conversion (VC) synthesizes speech in a target speaker's voice while preserving linguistic and paralinguistic content. However, timbre leakage-where source speaker traits persist-remains a challenge, especially in neural codec and LLM-based VC, where quantized representations entangle speaker identity with content. We introduce SemAlignVC, an architecture designed to prevent timbre leakage using SemAlign, a novel method that aligns text and audio representations to ensure speaker-independent semantic encoding. This disentangled representation conditions an autoregressive transformer for high-fidelity conversion without explicit speaker embeddings. Experiments show SemAlignVC significantly reduces timbre leakage, outperforming baselines in speaker timbre similarity, intelligibility, and naturalness, making it a robust, privacy-preserving, and generalizable VC solution. Audio samples can be accessed at https://shivammehta25.github.io/SemAlignVC/

Subjects: Audio and Speech Processing , Sound

Publish: 2025-07-11 23:14:07 UTC

2507.09070

#1 SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment [PDF2] [Copy] [Kimi] [REL]

#1 SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment [PDF²] [Copy] [Kimi] [REL]