Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

#1 Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data [PDF³] [Copy] [Kimi¹⁰] [REL]

Authors: Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Fadi Biadsy, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Françoise Beaufays, Hadar Shemtov

Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data sources, thereby leveraging massively multilingual joint speech and text representation learning. Without any transcribed speech in a new language, this TTS model can generate intelligible speech in >30 unseen languages (CER difference of <10% to ground truth). With just 15 minutes of transcribed, found data, we can reduce the intelligibility difference to 1% or less from the ground-truth, and achieve naturalness scores that match the ground-truth in several languages.

Subjects: Audio and Speech Processing , Sound

Publish: 2024-02-29 07:49:10 UTC

2402.18932

#1 Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data [PDF3] [Copy] [Kimi10] [REL]

#1 Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data [PDF³] [Copy] [Kimi¹⁰] [REL]