Enhancing Code-switched Text-to-Speech Synthesis Capability in Large Language Models with only Monolingual Corpora

#1 Enhancing Code-switched Text-to-Speech Synthesis Capability in Large Language Models with only Monolingual Corpora [PDF] [Copy] [Kimi] [REL]

Authors: Jing Xu, Daxin Tan, Jiaqi Wang, Xiao Chen

While Large Language Models (LLMs) have shown potential in speech generation and recognition, their applications are mainly confined to monolingual scenarios, with limited explorations in code-switched (CS) contexts. In this paper, we propose a Code-Switched Large Language Model (CS-LLM) to enhance the code-switched text-to-speech synthesis (CS TTS) capability in LLMs with only monolingual corpora. Specifically, we begin by enhancing the multilingual speech processing ability of LLMs through multilingual speech recognition and synthesis tasks. Then, we develop an effective code-switched (CS) data construction strategy that splits and concatenates words from different monolingual speech corpora to equip LLMs with improved CS TTS ability. Experiments show that our approach outperforms baselines in CS TTS in terms of naturalness, speaker consistency and similarity even with limited data. Additionally, the constructed CS data further improves multilingual speech synthesis and recognition.

Subjects: Audio and Speech Processing , Computation and Language , Sound

Publish: 2024-09-17 08:11:07 UTC

2409.10969

#1 Enhancing Code-switched Text-to-Speech Synthesis Capability in Large Language Models with only Monolingual Corpora [PDF] [Copy] [Kimi] [REL]