mondal25b@interspeech_2025@ISCA

Total: 1

#1 ExagTTS: An Approach Towards Controllable Word Stress Incorporated TTS for Exaggerated Synthesized Speech Aiding Second Language Learners [PDF3] [Copy] [Kimi] [REL]

Authors: Anindita Mondal, Monica Surtani, Anil Kumar Vuppala, Parameswari Krishnamurthy, Chiranjeevi Yarra

Computer-Assisted Language Learning systems provide speech exaggerating at mispronounced word locations as a feedback to the L2 learners. Traditionally, expert speakers recordings (which limit the scalability) are considered for this task though there are advancements in text-to-speech (TTS) that can generate native like natural sounding speech. To address these, this work proposes two novel controllable strategies for scalable speech exaggeration. One strategy is direct speech exaggeration that incorporates the proposed label conditioned tokenization in GlowTTS. Another strategy is cascading the state-of-the-art TTS to a WORLD vocoder with proposed energy and duration modifications. A subset of Tatoeba corpus, that we annotated with prominent words, is used for experimentation. Automatic and manual assessment reveals that the exaggerated speech quality from both direct and cascaded strategy with duration modification is closer to the prominent words in the native speaker's speech.

Subject: INTERSPEECH.2025 - Speech Synthesis