RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing ODE-based TTS models face a fundamental trade-off between generation steps and speech quality. To address this, we propose RapFlow-TTS: an end-to-end high-speed, high-fidelity text-to-speech framework. Our core innovation is the introduction of a velocity-field consistency constraint, which geometrically regularizes the generative dynamics by enforcing straightening of ODE trajectories along the learned velocity field during flow matching (FM) training—the first such geometric regularization in FM-based TTS. We further integrate dynamic time-step scheduling and discriminator-guided adversarial training to enhance audio fidelity and robustness under ultra-low-step synthesis. Experiments demonstrate that RapFlow-TTS achieves state-of-the-art naturalness and intelligibility while reducing required inference steps by 5× and 10× compared to standard flow matching and score matching baselines, respectively—significantly alleviating the inference bottleneck for real-time, high-quality TTS.

Technology Category

Application Category

📝 Abstract

We introduce RapFlow-TTS, a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training. Although ordinary differential equation (ODE)-based TTS generation achieves natural-quality speech, it typically requires a large number of generation steps, resulting in a trade-off between quality and inference speed. To address this challenge, RapFlow-TTS enforces consistency in the velocity field along the FM-straightened ODE trajectory, enabling consistent synthetic quality with fewer generation steps. Additionally, we introduce techniques such as time interval scheduling and adversarial learning to further enhance the quality of the few-step synthesis. Experimental results show that RapFlow-TTS achieves high-fidelity speech synthesis with a 5- and 10-fold reduction in synthesis steps than the conventional FM- and score-based approaches, respectively.

Problem

Research questions and friction points this paper is trying to address.

Improves TTS speed and quality with flow matching

Reduces synthesis steps while maintaining fidelity

Enhances few-step synthesis using adversarial learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Velocity consistency constraints in flow matching

Time interval scheduling for synthesis enhancement

Adversarial learning to improve few-step quality

🔎 Similar Papers

No similar papers found.