Dual-View Predictive Diffusion: Lightweight Speech Enhancement via Spectrogram-Image Synergy

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of existing diffusion models in speech enhancement, which treat spectrograms as generic images and overlook their inherent sparse spectral structure. To overcome this limitation, the authors propose DVPD, a lightweight dual-view predictive diffusion model that jointly captures both the visual texture and physical spectral characteristics of spectrograms. The key innovations include a frequency-adaptive non-uniform compression (FANC) encoder, a lightweight image-aware module (LISA), and a training-free, lossless boosting strategy (TLB). Evaluated across multiple benchmarks, DVPD achieves state-of-the-art performance while using only 35% of the parameters of PGUSE and reducing inference MACs to 40%, thereby striking an exceptional balance between high-fidelity enhancement and extreme model efficiency.

Technology Category

Application Category

📝 Abstract
Diffusion models have recently set new benchmarks in Speech Enhancement (SE). However, most existing score-based models treat speech spectrograms merely as generic 2D images, applying uniform processing that ignores the intrinsic structural sparsity of audio, which results in inefficient spectral representation and prohibitive computational complexity. To bridge this gap, we propose DVPD, an extremely lightweight Dual-View Predictive Diffusion model, which uniquely exploits the dual nature of spectrograms as both visual textures and physical frequency-domain representations across both training and inference stages. Specifically, during training, we optimize spectral utilization via the Frequency-Adaptive Non-uniform Compression (FANC) encoder, which preserves critical low-frequency harmonics while pruning high-frequency redundancies. Simultaneously, we introduce a Lightweight Image-based Spectro-Awareness (LISA) module to capture features from a visual perspective with minimal overhead. During inference, we propose a Training-free Lossless Boost (TLB) strategy that leverages the same dual-view priors to refine generation quality without any additional fine-tuning. Extensive experiments across various benchmarks demonstrate that DVPD achieves state-of-the-art performance while requiring only 35% of the parameters and 40% of the inference MACs compared to SOTA lightweight model, PGUSE. These results highlight DVPD's superior ability to balance high-fidelity speech quality with extreme architectural efficiency. Code and audio samples are available at the anonymous website: {https://anonymous.4open.science/r/dvpd_demo-E630}
Problem

Research questions and friction points this paper is trying to address.

Speech Enhancement
Diffusion Models
Spectrogram Representation
Computational Complexity
Structural Sparsity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-View Predictive Diffusion
Frequency-Adaptive Non-uniform Compression
Lightweight Image-based Spectro-Awareness
Training-free Lossless Boost
Speech Enhancement
🔎 Similar Papers
No similar papers found.
Ke Xue
Ke Xue
Nanjing University
Black-Box OptimizationMachine Learning
Rongfei Fan
Rongfei Fan
Beijing Institute of Technology
Federated LearningEdge ComputingResource AllocationStatistical Signal Processing
K
Kai Li
Department of Computer Science and Technology, Institute for AI, BNRist, Tsinghua University, Beijing 100084, China
S
Shanping Yu
School of Cyberspace Science and Technology, Beijing Institute of Technology, Beijing 100081, China
P
Puning Zhao
School of Cyberspace Science and Technology, Sun Yat-sen University, Guangzhou 510006, China
J
Jianping An
School of Cyberspace Science and Technology, Beijing Institute of Technology, Beijing 100081, China