🤖 AI Summary
Existing ODE-based TTS models face a fundamental trade-off between generation steps and speech quality. To address this, we propose RapFlow-TTS: an end-to-end high-speed, high-fidelity text-to-speech framework. Our core innovation is the introduction of a velocity-field consistency constraint, which geometrically regularizes the generative dynamics by enforcing straightening of ODE trajectories along the learned velocity field during flow matching (FM) training—the first such geometric regularization in FM-based TTS. We further integrate dynamic time-step scheduling and discriminator-guided adversarial training to enhance audio fidelity and robustness under ultra-low-step synthesis. Experiments demonstrate that RapFlow-TTS achieves state-of-the-art naturalness and intelligibility while reducing required inference steps by 5× and 10× compared to standard flow matching and score matching baselines, respectively—significantly alleviating the inference bottleneck for real-time, high-quality TTS.
📝 Abstract
We introduce RapFlow-TTS, a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training. Although ordinary differential equation (ODE)-based TTS generation achieves natural-quality speech, it typically requires a large number of generation steps, resulting in a trade-off between quality and inference speed. To address this challenge, RapFlow-TTS enforces consistency in the velocity field along the FM-straightened ODE trajectory, enabling consistent synthetic quality with fewer generation steps. Additionally, we introduce techniques such as time interval scheduling and adversarial learning to further enhance the quality of the few-step synthesis. Experimental results show that RapFlow-TTS achieves high-fidelity speech synthesis with a 5- and 10-fold reduction in synthesis steps than the conventional FM- and score-based approaches, respectively.