TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling

#1 TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Yuancheng Wang, Dekun Chen, Xueyao Zhang, Junan Zhang, Jiaqi Li, Zhizheng Wu

Speech tokenizers serve as foundational components for speech language models, yet current designs exhibit several limitations, including: (1) dependence on multi-layer residual vector quantization structures or high frame rates, (2) reliance on auxiliary pre-trained models for semantic distillation, and (3) requirements for complex two-stage training processes. In this work, we introduce the **T**ext-**a**ware **Di**ffusion Transformer Speech **Codec** (***TaDiCodec***), a novel approach designed to overcome these challenges. TaDiCodec employs end-to-end optimization for quantization and reconstruction through a diffusion autoencoder, while integrating text guidance into the diffusion decoder to enhance reconstruction quality and achieve optimal compression. TaDiCodec achieves an extremely low frame rate of **6.25 Hz** and a corresponding bitrate of **0.0875 kbps** with a **single-layer codebook** for 24 kHz speech, while maintaining superior performance on critical speech generation evaluation metrics such as Word Error Rate (WER), speaker similarity (SIM), and speech quality (UTMOS), Notably, TaDiCodec employs a single-stage, end-to-end training paradigm, and obviating the need for auxiliary pre-trained models. We also validate the compatibility of TaDiCodec in language model based zero-shot text-to-speech with both autoregressive modeling and masked generative modeling, demonstrating its effectiveness and efficiency for speech language modeling, as well as a significantly small *reconstruction-generation gap*. To facilitate reproducibility and further research, we will make our source code and pre-trained checkpoints publicly available. Audio samples are are available at https://tadicodec.github.io/. We release code and model checkpoints at https://github.com/AmphionTeam/TaDiCodec.

Subject: NeurIPS.2025 - Poster

wHsFqmM1rp@OpenReview

#1 TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling [PDF1] [Copy] [Kimi1] [REL]

#1 TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling [PDF¹] [Copy] [Kimi¹] [REL]