GST-BERT-TTS: Prosody Prediction Without Accentual Labels For Multi-Speaker TTS Using BERT With Global Style Tokens

#1 GST-BERT-TTS: Prosody Prediction Without Accentual Labels For Multi-Speaker TTS Using BERT With Global Style Tokens [PDF⁴] [Copy] [Kimi] [REL]

Authors: Tadashi Ogura, Takuma Okamoto, Yamato Ohtani, Erica Cooper, Tomoki Toda, Hisashi Kawai

Prosody prediction is crucial for pitch-accent languages like Japanese in text-to-speech (TTS) synthesis. Traditional methods rely on accent labels, which are often incomplete and do not generalize well. BERT-based models, such as fo-BERT, enable fundamental frequency prediction without accent labels but have been limited to single-speaker TTS. We propose GST-BERT-TTS, a novel method for multi-speaker TTS that integrates speaker-specific style embeddings from global style tokens (GST) into the token embeddings in BERT. The proposed method can realize speaker-aware fundamental frequency (fo) prediction in an accent label-free setting. Additionally, we extend fo-BERT to predict not only log fo but also energy and duration, improving speech expressiveness. Experiments using a Japanese multi-speaker TTS corpus demonstrate that GST-BERT-TTS improves the prosody prediction accuracy and synthesis quality compared with fo-BERT.

Subject: INTERSPEECH.2025 - Speech Synthesis

ogura25@interspeech_2025@ISCA

#1 GST-BERT-TTS: Prosody Prediction Without Accentual Labels For Multi-Speaker TTS Using BERT With Global Style Tokens [PDF4] [Copy] [Kimi] [REL]

#1 GST-BERT-TTS: Prosody Prediction Without Accentual Labels For Multi-Speaker TTS Using BERT With Global Style Tokens [PDF⁴] [Copy] [Kimi] [REL]