Utterance Selection for Optimizing Intelligibility of TTS Voices Trained on ASR Data

#1 Utterance Selection for Optimizing Intelligibility of TTS Voices Trained on ASR Data [PDF] [Copy] [Kimi¹] [REL]

Authors: Erica Cooper, Xinyue Wang, Alison Chang, Yocheved Levitan, Julia Hirschberg

This paper describes experiments in training HMM-based text-to-speech (TTS) voices on data collected for Automatic Speech Recognition (ASR) training. We compare a number of filtering techniques designed to identify the best utterances from a noisy, multi-speaker corpus for training voices, to exclude speech containing noise and to include speech close in nature to more traditionally-collected TTS corpora. We also evaluate the use of automatic speech recognizers for intelligibility assessment in comparison with crowdsourcing methods. While the goal of this work is to develop natural-sounding and intelligible TTS voices in Low Resource Languages (LRLs) rapidly and easily, without the expense of recording data specifically for this purpose, we focus on English initially to identify the best filtering techniques and evaluation methods. We find that, when a large amount of data is available, selecting from the corpus based on criteria such as standard deviation of f0, fast speaking rate, and hypo-articulation produces the most intelligible voices.

Subject: INTERSPEECH.2017 - Speech Synthesis

cooper17@interspeech_2017@ISCA

#1 Utterance Selection for Optimizing Intelligibility of TTS Voices Trained on ASR Data [PDF] [Copy] [Kimi1] [REL]

#1 Utterance Selection for Optimizing Intelligibility of TTS Voices Trained on ASR Data [PDF] [Copy] [Kimi¹] [REL]