Speaking Rate Control of end-to-end TTS Models by Direct Manipulation of the Encoder's Output Embeddings

#1 Speaking Rate Control of end-to-end TTS Models by Direct Manipulation of the Encoder's Output Embeddings [PDF] [Copy] [Kimi¹] [REL]

Authors: Martin Lenglet, Olivier Perrotin, Gérard Bailly

Since neural Text-To-Speech models have achieved such high standards in terms of naturalness, the main focus of the field has gradually shifted to gaining more control over the expressiveness of the synthetic voices. One of these leverages is the control of the speaking rate that has become harder for a human operator to control since the introduction of neural attention networks to model speech dynamics. While numerous models have reintroduced an explicit duration control (ex: FastSpeech2), these models generally rely on additional tasks to complete during their training. In this paper, we show how an acoustic analysis of the internal embeddings delivered by the encoder of an unsupervised end-to-end TTS Tacotron2 model is enough to identify and control some acoustic parameters of interest. Specifically, we compare this speaking rate control with the duration control offered by a supervised FastSpeech2 model. Experimental results show that the control provided by embeddings reproduces a behaviour closer to natural speech data.

Subject: INTERSPEECH.2022 - Speech Synthesis

lenglet22@interspeech_2022@ISCA

#1 Speaking Rate Control of end-to-end TTS Models by Direct Manipulation of the Encoder's Output Embeddings [PDF] [Copy] [Kimi1] [REL]

#1 Speaking Rate Control of end-to-end TTS Models by Direct Manipulation of the Encoder's Output Embeddings [PDF] [Copy] [Kimi¹] [REL]