ning24@interspeech_2024@ISCA

Total: 1

#1 DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion [PDF] [Copy] [Kimi] [REL]

Authors: Ziqian Ning ; Shuai Wang ; Pengcheng Zhu ; Zhichao Wang ; Jixun Yao ; Lei Xie ; Mengxiao Bi

Streaming voice conversion has gained popularity for its applicability in real-time applications. The recently proposed DualVC 2 has successfully achieved robust and high-quality streaming voice conversion in approximately 180ms. However, DualVC 2 is based on the recognition-synthesis framework, with multi-level cascade models that cannot be jointly optimized, and faces severe performance drops with small chunks caused by the ASR encoder. To address these issues, we propose an end-to-end model DualVC 3. It incorporates K-means clustered SSL features to guide the training of the content encoder and adopts an optional language model for pseudo-content generation to improve the conversion quality. Experimental results demonstrate that DualVC 3 achieves comparable performance to DualVC 2 in both subjective and objective metrics, with a latency of only 50 ms. We have made our audio samples publicly available.