Incorporating Cross-Speaker Style Transfer for Multi-Language Text-to-Speech

#1 Incorporating Cross-Speaker Style Transfer for Multi-Language Text-to-Speech [PDF] [Copy] [Kimi²] [REL]

Authors: Zengqiang Shang, Zhihua Huang, Haozhe Zhang, Pengyuan Zhang, Yonghong Yan

Recently multilingual TTS systems using only monolingual datasets have obtained significant improvement. However, the quality of cross-language speech synthesis is not comparable to the speaker’s own language and often comes with a heavy foreign accent. This paper proposed a multi-speaker multi-style multi-language speech synthesis system (M3), which improves the speech quality by introducing a fine-grained style encoder and overcomes the non-authentic accent problem through cross-speaker style transfer. To avoid leaking timbre information into style encoder, we utilized a speaker conditional variational encoder and conducted adversarial speaker training using the gradient reversal layer. Then, we built a Mixture Density Network (MDN) for mapping text to extracted style vectors for each speaker. At the inference stage, cross-language style transfer could be achieved by assigning any speaker’s style type in the target language. Our system uses existing speaker style and genuinely avoids foreign accents. In the MOS-speech-naturalness, the proposed method generally achieves 4.0 and significantly outperform the baseline system.

Subject: INTERSPEECH.2021 - Speech Synthesis

shang21@interspeech_2021@ISCA

#1 Incorporating Cross-Speaker Style Transfer for Multi-Language Text-to-Speech [PDF] [Copy] [Kimi2] [REL]

#1 Incorporating Cross-Speaker Style Transfer for Multi-Language Text-to-Speech [PDF] [Copy] [Kimi²] [REL]