Total: 1
Streaming voice conversion has gained popularity for its applicability in real-time applications. The recently proposed DualVC 2 has successfully achieved robust and high-quality streaming voice conversion in approximately 180ms. However, DualVC 2 is based on the recognition-synthesis framework, with multi-level cascade models that cannot be jointly optimized, and faces severe performance drops with small chunks caused by the ASR encoder. To address these issues, we propose an end-to-end model DualVC 3. It incorporates K-means clustered SSL features to guide the training of the content encoder and adopts an optional language model for pseudo-content generation to improve the conversion quality. Experimental results demonstrate that DualVC 3 achieves comparable performance to DualVC 2 in both subjective and objective metrics, with a latency of only 50 ms. We have made our audio samples publicly available.