DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion

#1 DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion [PDF⁴] [Copy] [Kimi²] [REL]

Authors: Ziqian Ning, Shuai Wang, Pengcheng Zhu, Zhichao Wang, Jixun Yao, Lei Xie, Mengxiao Bi

Streaming voice conversion has gained popularity for its applicability in real-time applications. The recently proposed DualVC 2 has successfully achieved robust and high-quality streaming voice conversion in approximately 180ms. However, DualVC 2 is based on the recognition-synthesis framework, with multi-level cascade models that cannot be jointly optimized, and faces severe performance drops with small chunks caused by the ASR encoder. To address these issues, we propose an end-to-end model DualVC 3. It incorporates K-means clustered SSL features to guide the training of the content encoder and adopts an optional language model for pseudo-content generation to improve the conversion quality. Experimental results demonstrate that DualVC 3 achieves comparable performance to DualVC 2 in both subjective and objective metrics, with a latency of only 50 ms. We have made our audio samples publicly available.

Subject: INTERSPEECH.2024 - Speech Synthesis

ning24@interspeech_2024@ISCA

#1 DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion [PDF4] [Copy] [Kimi2] [REL]

#1 DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion [PDF⁴] [Copy] [Kimi²] [REL]