VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

#1 VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech [PDF⁴] [Copy] [Kimi⁷] [REL]

Authors: Chenpeng Du, Yiwei Guo, Hankun Wang, Yifan Yang, Zhikang Niu, Shuai Wang, Hui Zhang, Xie Chen, Kai Yu

Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot adaptation given a speech prompt. However, such decoder-only TTS models lack monotonic alignment constraints, sometimes leading to hallucination issues such as mispronunciation, word skipping and repeating. To address this limitation, we propose VALL-T, a generative Transducer model that introduces shifting relative position embeddings for input phoneme sequence, explicitly indicating the monotonic generation process while maintaining the architecture of decoder-only Transformer. Consequently, VALL-T retains the capability of prompt-based zero-shot adaptation and demonstrates better robustness against hallucinations with a relative reduction of 28.3% in the word error rate. Furthermore, the controllability of alignment in VALL-T during decoding facilitates the use of untranscribed speech prompts, even in unknown languages. It also enables the synthesis of lengthy speech by utilizing an aligned context window.

Subjects: Audio and Speech Processing , Sound

Publish: 2024-01-25 17:19:01 UTC

2401.14321

#1 VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech [PDF4] [Copy] [Kimi7] [REL]

#1 VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech [PDF⁴] [Copy] [Kimi⁷] [REL]