🤖 AI Summary
Diffusion models for speech enhancement suffer from high inference latency due to multi-step iterative sampling, hindering real-time deployment.
Method: This paper proposes an efficient, single-stage trained flow-matching framework that introduces a time-conditioned velocity field learning scheme, enabling a step-invariant flow-matching model. The method employs deterministic ODE solvers with target-step conditioning, eliminating the need for architectural modifications or fine-tuning across varying step counts.
Contribution/Results: The approach achieves real-time inference with a real-time factor of only 0.013 on consumer-grade GPUs using a single step, while matching the perceptual quality of conventional 60-step diffusion models. It effectively breaks the long-standing trade-off between high-fidelity speech enhancement and ultra-low latency, enabling practical real-time applications without compromising audio quality.
📝 Abstract
Diffusion-based generative models have achieved state-of-the-art performance for perceptual quality in speech enhancement (SE). However, their iterative nature requires numerous Neural Function Evaluations (NFEs), posing a challenge for real-time applications. On the contrary, flow matching offers a more efficient alternative by learning a direct vector field, enabling high-quality synthesis in just a few steps using deterministic ordinary differential equation~(ODE) solvers. We thus introduce Shortcut Flow Matching for Speech Enhancement (SFMSE), a novel approach that trains a single, step-invariant model. By conditioning the velocity field on the target time step during a one-stage training process, SFMSE can perform single, few, or multi-step denoising without any architectural changes or fine-tuning. Our results demonstrate that a single-step SFMSE inference achieves a real-time factor (RTF) of 0.013 on a consumer GPU while delivering perceptual quality comparable to a strong diffusion baseline requiring 60 NFEs. This work also provides an empirical analysis of the role of stochasticity in training and inference, bridging the gap between high-quality generative SE and low-latency constraints.