F5R-TTS: Improving Flow Matching based Text-to-Speech with Group Relative Policy Optimization

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

To address the insufficient speech intelligibility and speaker similarity in flow-matching TTS systems, this paper proposes F5R-TTS. Methodologically, it introduces Gradient Reward Policy Optimization (GRPO) into the flow-matching framework for the first time, enabling seamless integration of reinforcement learning with flow matching by reformulating deterministic output reconstruction as Gaussian-distributed sampling. A dual-objective GRPO mechanism is further designed to jointly optimize ASR-based intelligibility (measured by WER) and speaker identity preservation (measured by SIM). The core contribution lies in a probabilistic reconstruction paradigm that unifies flow matching and policy optimization under a coherent probabilistic modeling framework. In zero-shot voice cloning experiments, F5R-TTS achieves a 29.5% relative reduction in WER and a 4.6% relative improvement in SIM over baseline flow-matching TTS, demonstrating substantial gains in both fidelity and speaker similarity.

Technology Category

Application Category

📝 Abstract

We present F5R-TTS, a novel text-to-speech (TTS) system that integrates Gradient Reward Policy Optimization (GRPO) into a flow-matching based architecture. By reformulating the deterministic outputs of flow-matching TTS into probabilistic Gaussian distributions, our approach enables seamless integration of reinforcement learning algorithms. During pretraining, we train a probabilistically reformulated flow-matching based model which is derived from F5-TTS with an open-source dataset. In the subsequent reinforcement learning (RL) phase, we employ a GRPO-driven enhancement stage that leverages dual reward metrics: word error rate (WER) computed via automatic speech recognition and speaker similarity (SIM) assessed by verification models. Experimental results on zero-shot voice cloning demonstrate that F5R-TTS achieves significant improvements in both speech intelligibility (relatively 29.5% WER reduction) and speaker similarity (relatively 4.6% SIM score increase) compared to conventional flow-matching based TTS systems. Audio samples are available at https://frontierlabs.github.io/F5R.

Problem

Research questions and friction points this paper is trying to address.

Improving TTS with flow-matching and reinforcement learning

Enhancing speech intelligibility and speaker similarity

Integrating probabilistic outputs with dual reward metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates GRPO into flow-matching TTS

Reformulates outputs as Gaussian distributions

Uses dual reward metrics WER and SIM

🔎 Similar Papers

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching