Total: 1
The control of perceptual voice qualities in a text-to-speech (TTS) system is of interest for applications where unmanipulated and manipulated speech probes can serve to illustrate phonetic concepts that are otherwise difficult to grasp. Here, we show that a TTS system, that is augmented with a global speaker attribute manipulation block based on normalizing flows, is capable of correctly manipulating the non-persistent, localized quality of creaky voice, thus avoiding the necessity of a, typically unreliable, frame-wise creak predictor. Subjective listening tests confirm successful creak manipulation at a slightly reduced MOS score compared to the original recording.