MeanFlowSE: One-Step Generative Speech Enhancement via MeanFlow

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing generative speech enhancement methods—such as diffusion models and flow matching—rely on multi-step sampling or large-scale architectures, resulting in slow inference and poor deployability. To address this, we propose MeanFlowSE: the first method to introduce MeanFlow into speech enhancement, establishing a single-step latent-space optimization framework that generates enhanced speech in one pass by predicting the mean velocity field. Crucially, it replaces conventional VAE latent variables with self-supervised learning (SSL)-derived representations, providing robust semantic guidance. Evaluated on the DNS Challenge blind test set, MeanFlowSE achieves state-of-the-art perceptual quality (P.808 MOS of 4.32), with a real-time factor of only 0.12 and fewer than 5 million parameters. It significantly outperforms existing generative approaches while maintaining high intelligibility and naturalness. The method thus offers an unprecedented combination of efficiency, lightweight design, and practical deployability.

Technology Category

Application Category

📝 Abstract
Speech enhancement (SE) recovers clean speech from noisy signals and is vital for applications such as telecommunications and automatic speech recognition (ASR). While generative approaches achieve strong perceptual quality, they often rely on multi-step sampling (diffusion/flow-matching) or large language models, limiting real-time deployment. To mitigate these constraints, we present MeanFlowSE, a one-step generative SE framework. It adopts MeanFlow to predict an average-velocity field for one-step latent refinement and conditions the model on self-supervised learning (SSL) representations rather than VAE latents. This design accelerates inference and provides robust acoustic-semantic guidance during training. In the Interspeech 2020 DNS Challenge blind test set and simulated test set, MeanFlowSE attains state-of-the-art (SOTA) level perceptual quality and competitive intelligibility while significantly lowering both real-time factor (RTF) and model size compared with recent generative competitors, making it suitable for practical use. The code will be released upon publication at https://github.com/Hello3orld/MeanFlowSE.
Problem

Research questions and friction points this paper is trying to address.

One-step generative speech enhancement for real-time deployment
Improving perceptual quality and intelligibility of noisy speech
Reducing computational cost and model size for practicality
Innovation

Methods, ideas, or system contributions that make the work stand out.

One-step generative speech enhancement via MeanFlow
Uses self-supervised learning representations for conditioning
Achieves real-time efficiency with reduced model size
🔎 Similar Papers
No similar papers found.
Y
Yike Zhu
Audio, Speech and Language Processing Group (ASLP@NPU), School of Software, Northwestern Polytechnical University, Xi’an, China
Boyi Kang
Boyi Kang
The Hong Kong University of Science and Technology
Multimodal IntelligenceAudio Processing
Z
Ziqian Wang
Audio, Speech and Language Processing Group (ASLP@NPU), School of Software, Northwestern Polytechnical University, Xi’an, China
X
Xingchen Li
Audio, Speech and Language Processing Group (ASLP@NPU), School of Software, Northwestern Polytechnical University, Xi’an, China
Z
Zihan Zhang
Huawei Technologies, China
W
Wenjie Li
Huawei Technologies, China
L
Longshuai Xiao
Huawei Technologies, China
W
Wei Xue
The Hong Kong University of Science and Technology, Hong Kong, China
L
Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), School of Software, Northwestern Polytechnical University, Xi’an, China