Target matching based generative model for speech enhancement

📅 2025-09-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion models for speech enhancement suffer from hallucination artifacts due to stochastic vector fields, training/inference instability, and high computational complexity—particularly in the NCSN++ architecture. To address these issues, this paper proposes a target-matching generative speech enhancement framework. Methodologically, it reformulates generation as a deterministic target-signal estimation task, eliminating stochasticity from the loss function; introduces logistic mean scheduling and bridge variance scheduling to optimize SNR evolution; and designs an audio-specific lightweight diffusion backbone that explicitly models long-range inter-frame dependencies and inter-band coupling. Experiments demonstrate substantial reductions in computational overhead, improved speech fidelity and enhancement quality, and superior performance over existing diffusion-based baselines in mitigating hallucinations, accelerating inference, and enhancing objective metrics—including PESQ and STOI.

Technology Category

Application Category

📝 Abstract
The design of mean and variance schedules for the perturbed signal is a fundamental challenge in generative models. While score-based and Schrödinger bridge-based models require careful selection of the stochastic differential equation to derive the corresponding schedules, flow-based models address this issue via vector field matching. However, this strategy often leads to hallucination artifacts and inefficient training and inference processes due to the potential inclusion of stochastic components in the vector field. Additionally, the widely adopted diffusion backbone, NCSN++, suffers from high computational complexity. To overcome these limitations, we propose a novel target-based generative framework that enhances both the flexibility of mean/variance schedule design and the efficiency of training and inference processes. Specifically, we eliminate the stochastic components in the training loss by reformulating the generative speech enhancement task as a target signal estimation problem, which therefore leads to more stable and efficient training and inference processes. In addition, we employ a logistic mean schedule and a bridge variance schedule, which yield a more favorable signal-to-noise ratio trajectory compared to several widely used schedules and thus leads to a more efficient perturbation strategy. Furthermore, we propose a new diffusion backbone for audio, which significantly improves the efficiency over NCSN++ by explicitly modeling long-term frame correlations and cross-band dependencies.
Problem

Research questions and friction points this paper is trying to address.

Designing mean and variance schedules for perturbed signals in generative models
Addressing hallucination artifacts and inefficient training in flow-based models
Reducing computational complexity of diffusion backbones for audio enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Target-based generative framework for speech enhancement
Logistic mean and bridge variance schedules
Efficient audio diffusion backbone modeling dependencies
🔎 Similar Papers
No similar papers found.
Taihui Wang
Taihui Wang
Institute of Acoustics, Chinese Academy of Sciences
statistical signal processingblind source seperationspeech dereverberation
R
Rilin Chen
Tencent AI Lab, Beijing 100193, China and Tencent Multimodal Models Department, Beijing 100193, China
T
Tong Lei
Tencent AI Lab, Shenzhen, China
A
Andong Li
Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China
J
Jinzheng Zhao
Tencent AI Lab, Beijing 100193, China and Tencent Multimodal Models Department, Beijing 100193, China
M
Meng Yu
Tencent AI Lab, Bellevue, WA, USA
D
Dong Yu
Tencent AI Lab, Bellevue, WA, USA