rautenberg25@interspeech_2025@ISCA

Total: 1

#1 Synthesizing Speech with Selected Perceptual Voice Qualities – A Case Study with Creaky Voice [PDF3] [Copy] [Kimi] [REL]

Authors: Frederik Rautenberg, Fritz Seebauer, Jana Wiechmann, Michael Kuhlmann, Petra Wagner, Reinhold Haeb-Umbach

The control of perceptual voice qualities in a text-to-speech (TTS) system is of interest for applications where unmanipulated and manipulated speech probes can serve to illustrate phonetic concepts that are otherwise difficult to grasp. Here, we show that a TTS system, that is augmented with a global speaker attribute manipulation block based on normalizing flows, is capable of correctly manipulating the non-persistent, localized quality of creaky voice, thus avoiding the necessity of a, typically unreliable, frame-wise creak predictor. Subjective listening tests confirm successful creak manipulation at a slightly reduced MOS score compared to the original recording.

Subject: INTERSPEECH.2025 - Speech Synthesis