RAD-MMM: Multilingual Multiaccented Multispeaker Text To Speech

#1 RAD-MMM: Multilingual Multiaccented Multispeaker Text To Speech [PDF²] [Copy] [Kimi²] [REL]

Authors: Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gururani, Bryan Catanzaro

We create a multilingual speech synthesis system that can generate speech with a native accent in any seen language while retaining the characteristics of an individual's voice. It is expensive to obtain bilingual training data for a speaker and the lack of such data results in strong correlations that entangle speaker, language, and accent, resulting in poor transfer capabilities. To overcome this, we present RADMMM, a speech synthesis model based on RADTTS with explicit control over accent, language, speaker, and fine-grained F0 and energy features. Our proposed model does not rely on bilingual training data. We demonstrate an ability to control synthesized accent for any speaker in an open-source dataset comprising of 7 languages, with one native speaker per language. Human subjective evaluation demonstrates that, when compared to controlled baselines, our model better retains a speaker's voice and target accent, while synthesizing fluent speech in all target languages and accents in our dataset.

Subject: INTERSPEECH.2023 - Speech Synthesis

badlani23@interspeech_2023@ISCA

#1 RAD-MMM: Multilingual Multiaccented Multispeaker Text To Speech [PDF2] [Copy] [Kimi2] [REL]

#1 RAD-MMM: Multilingual Multiaccented Multispeaker Text To Speech [PDF²] [Copy] [Kimi²] [REL]