Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Flow-matching text-to-speech (TTS) models suffer from slow inference due to reliance on multi-step iterative sampling. To address this, we propose Empirical Pruning Step Sampling (EPSS), a training- and model-agnostic non-uniform sampling strategy that identifies and skips redundant timesteps by analyzing forward sampling trajectories—requiring no model retraining. EPSS significantly accelerates inference while preserving speech quality. Evaluated on F5-TTS and E2 TTS, it generates high-fidelity speech in only 7 steps, achieving a real-time factor (RTF) of 0.030 on an RTX 3090—4× faster than standard sampling with no perceptible quality degradation. It further demonstrates strong generalization across architectures, including E2 TTS. To our knowledge, EPSS is the first trajectory-analysis-based, universally applicable, and efficient sampling paradigm for flow-matching TTS, offering a practical solution to inference latency without architectural or training modifications.

Technology Category

Application Category

📝 Abstract
Flow-matching-based text-to-speech (TTS) models, such as Voicebox, E2 TTS, and F5-TTS, have attracted significant attention in recent years. These models require multiple sampling steps to reconstruct speech from noise, making inference speed a key challenge. Reducing the number of sampling steps can greatly improve inference efficiency. To this end, we introduce Fast F5-TTS, a training-free approach to accelerate the inference of flow-matching-based TTS models. By inspecting the sampling trajectory of F5-TTS, we identify redundant steps and propose Empirically Pruned Step Sampling (EPSS), a non-uniform time-step sampling strategy that effectively reduces the number of sampling steps. Our approach achieves a 7-step generation with an inference RTF of 0.030 on an NVIDIA RTX 3090 GPU, making it 4 times faster than the original F5-TTS while maintaining comparable performance. Furthermore, EPSS performs well on E2 TTS models, demonstrating its strong generalization ability.
Problem

Research questions and friction points this paper is trying to address.

Reducing sampling steps in flow-matching TTS models
Improving inference speed without performance loss
Generalizing step pruning across different TTS models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prunes redundant steps via empirical analysis
Non-uniform step sampling for efficiency
Maintains performance with fewer steps
🔎 Similar Papers
No similar papers found.
Qixi Zheng
Qixi Zheng
Shanghai Jiao Tong University
voice conversiontext-to-speechdiffusion modelsflow matching
Yushen Chen
Yushen Chen
Shanghai Jiao Tong University
Speech and Language Processing
Zhikang Niu
Zhikang Niu
Shanghai Jiao Tong University
Speech Synthesis
Z
Ziyang Ma
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China
X
Xiaofei Wang
Microsoft, USA
K
Kai Yu
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China
X
Xie Chen
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China; Shanghai Innovation Institute, China