RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching

📅 2025-06-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
Existing ODE-based TTS models face a fundamental trade-off between generation steps and speech quality. To address this, we propose RapFlow-TTS: an end-to-end high-speed, high-fidelity text-to-speech framework. Our core innovation is the introduction of a velocity-field consistency constraint, which geometrically regularizes the generative dynamics by enforcing straightening of ODE trajectories along the learned velocity field during flow matching (FM) training—the first such geometric regularization in FM-based TTS. We further integrate dynamic time-step scheduling and discriminator-guided adversarial training to enhance audio fidelity and robustness under ultra-low-step synthesis. Experiments demonstrate that RapFlow-TTS achieves state-of-the-art naturalness and intelligibility while reducing required inference steps by 5× and 10× compared to standard flow matching and score matching baselines, respectively—significantly alleviating the inference bottleneck for real-time, high-quality TTS.

Technology Category

Application Category

📝 Abstract
We introduce RapFlow-TTS, a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training. Although ordinary differential equation (ODE)-based TTS generation achieves natural-quality speech, it typically requires a large number of generation steps, resulting in a trade-off between quality and inference speed. To address this challenge, RapFlow-TTS enforces consistency in the velocity field along the FM-straightened ODE trajectory, enabling consistent synthetic quality with fewer generation steps. Additionally, we introduce techniques such as time interval scheduling and adversarial learning to further enhance the quality of the few-step synthesis. Experimental results show that RapFlow-TTS achieves high-fidelity speech synthesis with a 5- and 10-fold reduction in synthesis steps than the conventional FM- and score-based approaches, respectively.
Problem

Research questions and friction points this paper is trying to address.

Improves TTS speed and quality with flow matching
Reduces synthesis steps while maintaining fidelity
Enhances few-step synthesis using adversarial learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Velocity consistency constraints in flow matching
Time interval scheduling for synthesis enhancement
Adversarial learning to improve few-step quality
🔎 Similar Papers
No similar papers found.
H
Hyun Joon Park
NAVER Cloud, Republic of Korea; School of Industrial and Management Engineering, Korea University, Republic of Korea
J
Jeongmin Liu
NAVER Cloud, Republic of Korea
J
Jin Sob Kim
School of Industrial and Management Engineering, Korea University, Republic of Korea
J
Jeong Yeol Yang
School of Industrial and Management Engineering, Korea University, Republic of Korea
S
Sung Won Han
School of Industrial and Management Engineering, Korea University, Republic of Korea
Eunwoo Song
Eunwoo Song
Voice, Naver Cloud
Speech Synthesis