F5R-TTS: Improving Flow Matching based Text-to-Speech with Group Relative Policy Optimization

#1 F5R-TTS: Improving Flow Matching based Text-to-Speech with Group Relative Policy Optimization [PDF⁴] [Copy] [Kimi⁵] [REL]

Authors: Xiaohui Sun, Ruitong Xiao, Jianye Mo, Bowen Wu, Qun Yu, Baoxun Wang

We present F5R-TTS, a novel text-to-speech (TTS) system that integrates Gradient Reward Policy Optimization (GRPO) into a flow-matching based architecture. By reformulating the deterministic outputs of flow-matching TTS into probabilistic Gaussian distributions, our approach enables seamless integration of reinforcement learning algorithms. During pretraining, we train a probabilistically reformulated flow-matching based model which is derived from F5-TTS with an open-source dataset. In the subsequent reinforcement learning (RL) phase, we employ a GRPO-driven enhancement stage that leverages dual reward metrics: word error rate (WER) computed via automatic speech recognition and speaker similarity (SIM) assessed by verification models. Experimental results on zero-shot voice cloning demonstrate that F5R-TTS achieves significant improvements in both speech intelligibility (relatively 29.5\% WER reduction) and speaker similarity (relatively 4.6\% SIM score increase) compared to conventional flow-matching based TTS systems. Audio samples are available at https://frontierlabs.github.io/F5R.

Subjects: Sound , Audio and Speech Processing

Publish: 2025-04-03 08:57:15 UTC

2504.02407

#1 F5R-TTS: Improving Flow Matching based Text-to-Speech with Group Relative Policy Optimization [PDF4] [Copy] [Kimi5] [REL]

#1 F5R-TTS: Improving Flow Matching based Text-to-Speech with Group Relative Policy Optimization [PDF⁴] [Copy] [Kimi⁵] [REL]