Shortcut Flow Matching for Speech Enhancement: Step-Invariant flows via single stage training

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion models for speech enhancement suffer from high inference latency due to multi-step iterative sampling, hindering real-time deployment. Method: This paper proposes an efficient, single-stage trained flow-matching framework that introduces a time-conditioned velocity field learning scheme, enabling a step-invariant flow-matching model. The method employs deterministic ODE solvers with target-step conditioning, eliminating the need for architectural modifications or fine-tuning across varying step counts. Contribution/Results: The approach achieves real-time inference with a real-time factor of only 0.013 on consumer-grade GPUs using a single step, while matching the perceptual quality of conventional 60-step diffusion models. It effectively breaks the long-standing trade-off between high-fidelity speech enhancement and ultra-low latency, enabling practical real-time applications without compromising audio quality.

Technology Category

Application Category

📝 Abstract
Diffusion-based generative models have achieved state-of-the-art performance for perceptual quality in speech enhancement (SE). However, their iterative nature requires numerous Neural Function Evaluations (NFEs), posing a challenge for real-time applications. On the contrary, flow matching offers a more efficient alternative by learning a direct vector field, enabling high-quality synthesis in just a few steps using deterministic ordinary differential equation~(ODE) solvers. We thus introduce Shortcut Flow Matching for Speech Enhancement (SFMSE), a novel approach that trains a single, step-invariant model. By conditioning the velocity field on the target time step during a one-stage training process, SFMSE can perform single, few, or multi-step denoising without any architectural changes or fine-tuning. Our results demonstrate that a single-step SFMSE inference achieves a real-time factor (RTF) of 0.013 on a consumer GPU while delivering perceptual quality comparable to a strong diffusion baseline requiring 60 NFEs. This work also provides an empirical analysis of the role of stochasticity in training and inference, bridging the gap between high-quality generative SE and low-latency constraints.
Problem

Research questions and friction points this paper is trying to address.

Achieving real-time speech enhancement with diffusion models
Reducing Neural Function Evaluations for low-latency applications
Maintaining perceptual quality while enabling flexible denoising steps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single step-invariant model training
Time-conditioned velocity field for denoising
Single-step inference achieving real-time performance
🔎 Similar Papers
N
Naisong Zhou
EPFL, Lausanne, CH
S
Saisamarth Rajesh Phaye
Logitech, Lausanne, CH
Milos Cernak
Milos Cernak
Logitech, EPFL - Quartier de l'Innovation
Meeting SpeechSpeech Analysis-Synthesis and CodingPathological Speech ProcessingArtificial Intelligence
T
Tijana Stojkovic
Logitech, Lausanne, CH
A
Andy Pearce
Logitech, Lausanne, CH
Andrea Cavallaro
Andrea Cavallaro
Director, Idiap Research Institute; Professor, EPFL
Machine LearningComputer VisionAudio ProcessingRobot PerceptionPrivacy
A
Andy Harper
Logitech, Lausanne, CH