F5R-TTS: Improving Flow Matching based Text-to-Speech with Group Relative Policy Optimization

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient speech intelligibility and speaker similarity in flow-matching TTS systems, this paper proposes F5R-TTS. Methodologically, it introduces Gradient Reward Policy Optimization (GRPO) into the flow-matching framework for the first time, enabling seamless integration of reinforcement learning with flow matching by reformulating deterministic output reconstruction as Gaussian-distributed sampling. A dual-objective GRPO mechanism is further designed to jointly optimize ASR-based intelligibility (measured by WER) and speaker identity preservation (measured by SIM). The core contribution lies in a probabilistic reconstruction paradigm that unifies flow matching and policy optimization under a coherent probabilistic modeling framework. In zero-shot voice cloning experiments, F5R-TTS achieves a 29.5% relative reduction in WER and a 4.6% relative improvement in SIM over baseline flow-matching TTS, demonstrating substantial gains in both fidelity and speaker similarity.

Technology Category

Application Category

📝 Abstract
We present F5R-TTS, a novel text-to-speech (TTS) system that integrates Gradient Reward Policy Optimization (GRPO) into a flow-matching based architecture. By reformulating the deterministic outputs of flow-matching TTS into probabilistic Gaussian distributions, our approach enables seamless integration of reinforcement learning algorithms. During pretraining, we train a probabilistically reformulated flow-matching based model which is derived from F5-TTS with an open-source dataset. In the subsequent reinforcement learning (RL) phase, we employ a GRPO-driven enhancement stage that leverages dual reward metrics: word error rate (WER) computed via automatic speech recognition and speaker similarity (SIM) assessed by verification models. Experimental results on zero-shot voice cloning demonstrate that F5R-TTS achieves significant improvements in both speech intelligibility (relatively 29.5% WER reduction) and speaker similarity (relatively 4.6% SIM score increase) compared to conventional flow-matching based TTS systems. Audio samples are available at https://frontierlabs.github.io/F5R.
Problem

Research questions and friction points this paper is trying to address.

Improving TTS with flow-matching and reinforcement learning
Enhancing speech intelligibility and speaker similarity
Integrating probabilistic outputs with dual reward metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates GRPO into flow-matching TTS
Reformulates outputs as Gaussian distributions
Uses dual reward metrics WER and SIM
🔎 Similar Papers
No similar papers found.
X
Xiaohui Sun
Platform and Content Group, Tencent
J
Jianye Mo
Platform and Content Group, Tencent
B
Bowen Wu
Platform and Content Group, Tencent
Q
Qun Yu
Platform and Content Group, Tencent
Baoxun Wang
Baoxun Wang
PCG, Tencent
Natural Language ProcessingDeep LearningChat-Bot