Total: 1
Prosody prediction is crucial for pitch-accent languages like Japanese in text-to-speech (TTS) synthesis. Traditional methods rely on accent labels, which are often incomplete and do not generalize well. BERT-based models, such as fo-BERT, enable fundamental frequency prediction without accent labels but have been limited to single-speaker TTS. We propose GST-BERT-TTS, a novel method for multi-speaker TTS that integrates speaker-specific style embeddings from global style tokens (GST) into the token embeddings in BERT. The proposed method can realize speaker-aware fundamental frequency (fo) prediction in an accent label-free setting. Additionally, we extend fo-BERT to predict not only log fo but also energy and duration, improving speech expressiveness. Experiments using a Japanese multi-speaker TTS corpus demonstrate that GST-BERT-TTS improves the prosody prediction accuracy and synthesis quality compared with fo-BERT.