Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Flow-matching text-to-speech (TTS) models suffer from slow inference due to reliance on multi-step iterative sampling. To address this, we propose Empirical Pruning Step Sampling (EPSS), a training- and model-agnostic non-uniform sampling strategy that identifies and skips redundant timesteps by analyzing forward sampling trajectories—requiring no model retraining. EPSS significantly accelerates inference while preserving speech quality. Evaluated on F5-TTS and E2 TTS, it generates high-fidelity speech in only 7 steps, achieving a real-time factor (RTF) of 0.030 on an RTX 3090—4× faster than standard sampling with no perceptible quality degradation. It further demonstrates strong generalization across architectures, including E2 TTS. To our knowledge, EPSS is the first trajectory-analysis-based, universally applicable, and efficient sampling paradigm for flow-matching TTS, offering a practical solution to inference latency without architectural or training modifications.

Technology Category

Application Category

📝 Abstract

Flow-matching-based text-to-speech (TTS) models, such as Voicebox, E2 TTS, and F5-TTS, have attracted significant attention in recent years. These models require multiple sampling steps to reconstruct speech from noise, making inference speed a key challenge. Reducing the number of sampling steps can greatly improve inference efficiency. To this end, we introduce Fast F5-TTS, a training-free approach to accelerate the inference of flow-matching-based TTS models. By inspecting the sampling trajectory of F5-TTS, we identify redundant steps and propose Empirically Pruned Step Sampling (EPSS), a non-uniform time-step sampling strategy that effectively reduces the number of sampling steps. Our approach achieves a 7-step generation with an inference RTF of 0.030 on an NVIDIA RTX 3090 GPU, making it 4 times faster than the original F5-TTS while maintaining comparable performance. Furthermore, EPSS performs well on E2 TTS models, demonstrating its strong generalization ability.

Problem

Research questions and friction points this paper is trying to address.

Reducing sampling steps in flow-matching TTS models

Improving inference speed without performance loss

Generalizing step pruning across different TTS models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prunes redundant steps via empirical analysis

Non-uniform step sampling for efficiency

Maintains performance with fewer steps

🔎 Similar Papers

No similar papers found.

Authors to Follow