Total: 1
Recently, language model (LM)-based speech synthesis models have shown remarkable naturalness and powerful zero-shot capabilities. In this paradigm, discrete speech tokens play a critical role. Prior work has proposed using automatic speech recognition (ASR) tasks to enhance the semantic information and alignment with text in tokens. However, the commonly used byte-pair encoding (BPE) tokenizer in ASR task leads to significant differences in the text token sets of different languages, making it difficult to exploit language-shared information. This paper proposes to use the International Phonetic Alphabet (IPA) as the training target for ASR to learn language-independent speech tokens. In addition, we propose to use a timbre converter for speaker disentanglement in the speech synthesis model. Our proposed approach effectively improves the speaker similarity and expressiveness in both multilingual and cross-lingual zero-shot speech synthesis.