Zero-Shot Voice Conversion via Content-Aware Timbre Ensemble and Conditional Flow Matching

#1 Zero-Shot Voice Conversion via Content-Aware Timbre Ensemble and Conditional Flow Matching [PDF²] [Copy] [Kimi³] [REL]

Authors: Yu Pan, Yuguang Yang, Jixun Yao, Lei Ma, Jianjun Zhao

Despite recent advances in zero-shot voice conversion (VC), achieving speaker similarity and naturalness comparable to ground-truth recordings remains a significant challenge. In this letter, we propose CTEFM-VC, a zero-shot VC framework that integrates content-aware timbre ensemble modeling with conditional flow matching. Specifically, CTEFM-VC decouples utterances into content and timbre representations and leverages a conditional flow matching model to reconstruct the Mel-spectrogram of the source speech. To enhance its timbre modeling capability and naturalness of generated speech, we first introduce a context-aware timbre ensemble modeling approach that adaptively integrates diverse speaker verification embeddings and enables the effective utilization of source content and target timbre elements through a cross-attention module. Furthermore, a structural similarity-based timbre loss is presented to jointly train CTEFM-VC end-to-end. Experiments show that CTEFM-VC consistently achieves the best performance in all metrics assessing speaker similarity, speech naturalness, and intelligibility, significantly outperforming state-of-the-art zero-shot VC systems.

Subjects: Sound , Artificial Intelligence , Audio and Speech Processing

Publish: 2024-11-04 12:23:17 UTC

2411.02026

#1 Zero-Shot Voice Conversion via Content-Aware Timbre Ensemble and Conditional Flow Matching [PDF2] [Copy] [Kimi3] [REL]

#1 Zero-Shot Voice Conversion via Content-Aware Timbre Ensemble and Conditional Flow Matching [PDF²] [Copy] [Kimi³] [REL]