2509.19001

Total: 1

#1 HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS [PDF3] [Copy] [Kimi1] [REL]

Authors: Sihang Nie, Xiaofen Xing, Jingyuan Xing, Baiji Liu, Xiangmin Xu

Large Language Model (LLM)-based Text-to-Speech (TTS) models have already reached a high degree of naturalness. However, the precision control of TTS inference is still challenging. Although instruction-based Text-to-Speech (Instruct-TTS) models are proposed, these models still lack fine-grained control due to the modality gap between single-level text instructions and multilevel speech tokens. To address this limitation, we propose HD-PPT, a framework that transforms speech synthesis into a structured, hierarchical task. To enable fine-grained control, we introduce a novel speech codec to extract distinct prompt-preference and content-preference tokens from the complex speech tokens, supervised by automatic speech recognition (ASR) and cross-lingual audio-text pre-training (CLAP) objectives. To bridge the modality gap of these tokens, we propose a hierarchical decoding strategy, where the LLM generates tokens in a structured order: first semantic, then fine-grained style, and finally complete acoustic representation. Extensive experiments demonstrate that this hierarchical paradigm significantly improves instruction adherence and achieves state-of-the-art naturalness, validating our approach for precise and controllable speech synthesis. Audio samples are available at https://xxh333.github.io/.

Subjects: Audio and Speech Processing , Sound

Publish: 2025-09-23 13:45:56 UTC