🤖 AI Summary
Flow-matching text-to-speech (TTS) models suffer from slow inference due to reliance on multi-step iterative sampling. To address this, we propose Empirical Pruning Step Sampling (EPSS), a training- and model-agnostic non-uniform sampling strategy that identifies and skips redundant timesteps by analyzing forward sampling trajectories—requiring no model retraining. EPSS significantly accelerates inference while preserving speech quality. Evaluated on F5-TTS and E2 TTS, it generates high-fidelity speech in only 7 steps, achieving a real-time factor (RTF) of 0.030 on an RTX 3090—4× faster than standard sampling with no perceptible quality degradation. It further demonstrates strong generalization across architectures, including E2 TTS. To our knowledge, EPSS is the first trajectory-analysis-based, universally applicable, and efficient sampling paradigm for flow-matching TTS, offering a practical solution to inference latency without architectural or training modifications.
📝 Abstract
Flow-matching-based text-to-speech (TTS) models, such as Voicebox, E2 TTS, and F5-TTS, have attracted significant attention in recent years. These models require multiple sampling steps to reconstruct speech from noise, making inference speed a key challenge. Reducing the number of sampling steps can greatly improve inference efficiency. To this end, we introduce Fast F5-TTS, a training-free approach to accelerate the inference of flow-matching-based TTS models. By inspecting the sampling trajectory of F5-TTS, we identify redundant steps and propose Empirically Pruned Step Sampling (EPSS), a non-uniform time-step sampling strategy that effectively reduces the number of sampling steps. Our approach achieves a 7-step generation with an inference RTF of 0.030 on an NVIDIA RTX 3090 GPU, making it 4 times faster than the original F5-TTS while maintaining comparable performance. Furthermore, EPSS performs well on E2 TTS models, demonstrating its strong generalization ability.