Shortcut Flow Matching for Speech Enhancement: Step-Invariant flows via single stage training

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

Diffusion models for speech enhancement suffer from high inference latency due to multi-step iterative sampling, hindering real-time deployment. Method: This paper proposes an efficient, single-stage trained flow-matching framework that introduces a time-conditioned velocity field learning scheme, enabling a step-invariant flow-matching model. The method employs deterministic ODE solvers with target-step conditioning, eliminating the need for architectural modifications or fine-tuning across varying step counts. Contribution/Results: The approach achieves real-time inference with a real-time factor of only 0.013 on consumer-grade GPUs using a single step, while matching the perceptual quality of conventional 60-step diffusion models. It effectively breaks the long-standing trade-off between high-fidelity speech enhancement and ultra-low latency, enabling practical real-time applications without compromising audio quality.

Technology Category

Application Category

📝 Abstract

Diffusion-based generative models have achieved state-of-the-art performance for perceptual quality in speech enhancement (SE). However, their iterative nature requires numerous Neural Function Evaluations (NFEs), posing a challenge for real-time applications. On the contrary, flow matching offers a more efficient alternative by learning a direct vector field, enabling high-quality synthesis in just a few steps using deterministic ordinary differential equation~(ODE) solvers. We thus introduce Shortcut Flow Matching for Speech Enhancement (SFMSE), a novel approach that trains a single, step-invariant model. By conditioning the velocity field on the target time step during a one-stage training process, SFMSE can perform single, few, or multi-step denoising without any architectural changes or fine-tuning. Our results demonstrate that a single-step SFMSE inference achieves a real-time factor (RTF) of 0.013 on a consumer GPU while delivering perceptual quality comparable to a strong diffusion baseline requiring 60 NFEs. This work also provides an empirical analysis of the role of stochasticity in training and inference, bridging the gap between high-quality generative SE and low-latency constraints.

Problem

Research questions and friction points this paper is trying to address.

Achieving real-time speech enhancement with diffusion models

Reducing Neural Function Evaluations for low-latency applications

Maintaining perceptual quality while enabling flexible denoising steps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single step-invariant model training

Time-conditioned velocity field for denoising

Single-step inference achieving real-time performance

🔎 Similar Papers

High-Resolution Speech Restoration with Latent Diffusion Model