GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture

#1 GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture [PDF³] [Copy] [Kimi³] [REL]

Authors: Yaodong Song, Hongjie Chen, Jie Lian, Yuxin Zhang, Guangmin Xia, Zehan Li, Genliang Zhao, Jian Kang, Yongxiang Li, Jie Li

While large language models (LLMs) have revolutionized text-to-speech (TTS) synthesis through discrete tokenization paradigms, current architectures exhibit fundamental tensions between three critical dimensions: 1) irreversible loss of acoustic characteristics caused by quantization of speech prompts; 2) stringent dependence on precisely aligned prompt speech-text pairs that limit real-world deployment; and 3) catastrophic forgetting of the LLM's native text comprehension during optimization for speech token generation. To address these challenges, we propose an LLM-based text-to-speech Generation approach Optimized via a novel dual-branch ArchiTecture (GOAT-TTS). Our framework introduces two key innovations: (1) The modality-alignment branch combines a speech encoder and projector to capture continuous acoustic embeddings, enabling bidirectional correlation between paralinguistic features (language, timbre, emotion) and semantic text representations without transcript dependency; (2) The speech-generation branch employs modular fine-tuning on top-k layers of an LLM for speech token prediction while freezing the bottom-k layers to preserve foundational linguistic knowledge. Moreover, multi-token prediction is introduced to support real-time streaming TTS synthesis. Experimental results demonstrate that our GOAT-TTS achieves performance comparable to state-of-the-art TTS models while validating the efficacy of synthesized dialect speech data.

Subjects: Computation and Language , Sound , Audio and Speech Processing

Publish: 2025-04-15 01:44:56 UTC

2504.12339

#1 GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture [PDF3] [Copy] [Kimi3] [REL]

#1 GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture [PDF³] [Copy] [Kimi³] [REL]