Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

#1 Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis [PDF³] [Copy] [Kimi³] [REL]

Authors: Ziyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Zhenhui Ye, Chen Zhang, Bai Jionghao, Xiaoda Yang, Jialong Zuo, Yu Zhang, Rui Liu, Xiang Yin, Zhou Zhao

While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness, especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalness constraints of forced alignments. This paper introduces \textit{S-DiT}, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, we provide sparse alignment boundaries to S-DiT to reduce the difficulty of alignment learning without limiting the search space, thereby achieving high naturalness. Moreover, we employ a multi-condition classifier-free guidance strategy for accent intensity adjustment and adopt the piecewise rectified flow technique to accelerate the generation process. Experiments demonstrate that S-DiT achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity. Notably, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples are available at https://sditdemo.github.io/sditdemo/.

Subjects: Audio and Speech Processing , Machine Learning , Sound

Publish: 2025-02-26 08:22:00 UTC

2502.18924

#1 Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis [PDF3] [Copy] [Kimi3] [REL]

#1 Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis [PDF³] [Copy] [Kimi³] [REL]