Real-Time Streamable Generative Speech Restoration with Flow Matching

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

To address the high computational cost of diffusion models and their inability to meet the ultra-low-latency requirements of real-time voice communication, this paper proposes the first frame-causal streaming Flow Matching framework for speech restoration. Methodologically, we introduce a buffered streaming inference mechanism, a few-step adaptive ODE solver, a lightweight DNN architecture, and a joint model pruning–quantization compression strategy. Our approach achieves an end-to-end latency of 24–48 ms (as low as 24 ms), supporting multiple tasks including speech enhancement, dereverberation, and post-filtering for codec artifacts. Experiments demonstrate state-of-the-art performance among generative streaming speech restoration methods on the MUSHRA evaluation, with perceptual quality approaching that of non-streaming counterparts and significantly outperforming baselines such as Diffusion Buffer. Moreover, the framework runs in real time on consumer-grade GPUs.

Technology Category

Application Category

📝 Abstract

Diffusion-based generative models have greatly impacted the speech processing field in recent years, exhibiting high speech naturalness and spawning a new research direction. Their application in real-time communication is, however, still lagging behind due to their computation-heavy nature involving multiple calls of large DNNs. Here, we present Stream.FM, a frame-causal flow-based generative model with an algorithmic latency of 32 milliseconds (ms) and a total latency of 48 ms, paving the way for generative speech processing in real-time communication. We propose a buffered streaming inference scheme and an optimized DNN architecture, show how learned few-step numerical solvers can boost output quality at a fixed compute budget, explore model weight compression to find favorable points along a compute/quality tradeoff, and contribute a model variant with 24 ms total latency for the speech enhancement task. Our work looks beyond theoretical latencies, showing that high-quality streaming generative speech processing can be realized on consumer GPUs available today. Stream.FM can solve a variety of speech processing tasks in a streaming fashion: speech enhancement, dereverberation, codec post-filtering, bandwidth extension, STFT phase retrieval, and Mel vocoding. As we verify through comprehensive evaluations and a MUSHRA listening test, Stream.FM establishes a state-of-the-art for generative streaming speech restoration, exhibits only a reasonable reduction in quality compared to a non-streaming variant, and outperforms our recent work (Diffusion Buffer) on generative streaming speech enhancement while operating at a lower latency.

Problem

Research questions and friction points this paper is trying to address.

Develops a real-time generative speech restoration model with low latency

Enables streaming speech processing tasks like enhancement and dereverberation

Optimizes model for consumer GPUs balancing quality and computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming inference with 32ms algorithmic latency

Optimized DNN architecture and weight compression

Learned few-step solvers for quality-compute tradeoff

🔎 Similar Papers

High-Resolution Speech Restoration with Latent Diffusion Model