badlani23@interspeech_2023@ISCA

Total: 1

#1 RAD-MMM: Multilingual Multiaccented Multispeaker Text To Speech [PDF2] [Copy] [Kimi2]

Authors: Rohan Badlani ; Rafael Valle ; Kevin J. Shih ; João Felipe Santos ; Siddharth Gururani ; Bryan Catanzaro

We create a multilingual speech synthesis system that can generate speech with a native accent in any seen language while retaining the characteristics of an individual's voice. It is expensive to obtain bilingual training data for a speaker and the lack of such data results in strong correlations that entangle speaker, language, and accent, resulting in poor transfer capabilities. To overcome this, we present RADMMM, a speech synthesis model based on RADTTS with explicit control over accent, language, speaker, and fine-grained F0 and energy features. Our proposed model does not rely on bilingual training data. We demonstrate an ability to control synthesized accent for any speaker in an open-source dataset comprising of 7 languages, with one native speaker per language. Human subjective evaluation demonstrates that, when compared to controlled baselines, our model better retains a speaker's voice and target accent, while synthesizing fluent speech in all target languages and accents in our dataset.