seong24b@interspeech_2024@ISCA

Total: 1

#1 TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech [PDF] [Copy] [Kimi] [REL]

Authors: Donghyun Seong ; Hoyoung Lee ; Joon-Hyuk Chang

Expressive text-to-speech (TTS) aims to synthesize better human-like speech by incorporating diverse speech styles or emotions. While most expressive TTS models rely on reference speech to condition the style of the generated speech, they often fail to generate speech of regular quality. To ensure consistent speech quality, we propose an expressive TTS conditioned on style representation extracted from the text itself. To implement this text-based style predictor, we design a style module incorporating residual vector quantization. Furthermore, the style representation is enhanced through style-to-text alignment and a mel decoder with style hierarchical layer normalization (SHLN). Our experimental findings demonstrate that our proposed model accurately estimates style representation, enabling the generation of high-quality speech without the need for reference speech.