LIST: Language-Independent Speech Token for Multilingual Speech Synthesis with Language Models

#1 LIST: Language-Independent Speech Token for Multilingual Speech Synthesis with Language Models [PDF²] [Copy] [Kimi¹] [REL]

Authors: Chang Liu, Zhen-Hua Ling, Yu Gu

Recently, language model (LM)-based speech synthesis models have shown remarkable naturalness and powerful zero-shot capabilities. In this paradigm, discrete speech tokens play a critical role. Prior work has proposed using automatic speech recognition (ASR) tasks to enhance the semantic information and alignment with text in tokens. However, the commonly used byte-pair encoding (BPE) tokenizer in ASR task leads to significant differences in the text token sets of different languages, making it difficult to exploit language-shared information. This paper proposes to use the International Phonetic Alphabet (IPA) as the training target for ASR to learn language-independent speech tokens. In addition, we propose to use a timbre converter for speaker disentanglement in the speech synthesis model. Our proposed approach effectively improves the speaker similarity and expressiveness in both multilingual and cross-lingual zero-shot speech synthesis.

Subject: INTERSPEECH.2025 - Speech Synthesis

liu25o@interspeech_2025@ISCA

#1 LIST: Language-Independent Speech Token for Multilingual Speech Synthesis with Language Models [PDF2] [Copy] [Kimi1] [REL]

#1 LIST: Language-Independent Speech Token for Multilingual Speech Synthesis with Language Models [PDF²] [Copy] [Kimi¹] [REL]