🤖 AI Summary
Speech enhancement for resource-constrained devices demands a careful trade-off between model efficiency and performance. To address this, we propose IMSE, an ultra-lightweight U-Net architecture. First, we introduce Magnitude-Aware Linear Attention (MALA), a novel attention mechanism that explicitly incorporates amplitude information to enable efficient global spectral modeling. Second, we design Inception-style Depthwise Convolution (IDConv), which captures multi-scale spectral features with minimal parameter overhead, replacing computationally expensive large-kernel convolutions. Evaluated on the VoiceBank+DEMAND corpus, IMSE achieves only 0.427M parameters—16.8% fewer than MUSE—while attaining a PESQ score of 3.373, matching state-of-the-art performance. The architecture significantly improves computational efficiency and deployment feasibility on edge devices without compromising perceptual quality.
📝 Abstract
Achieving a balance between lightweight design and high performance remains a significant challenge for speech enhancement (SE) tasks on resource-constrained devices. Existing state-of-the-art methods, such as MUSE, have established a strong baseline with only 0.51M parameters by introducing a Multi-path Enhanced Taylor (MET) transformer and Deformable Embedding (DE). However, an in-depth analysis reveals that MUSE still suffers from efficiency bottlenecks: the MET module relies on a complex "approximate-compensate" mechanism to mitigate the limitations of Taylor-expansion-based attention, while the offset calculation for deformable embedding introduces additional computational burden. This paper proposes IMSE, a systematically optimized and ultra-lightweight network. We introduce two core innovations: 1) Replacing the MET module with Amplitude-Aware Linear Attention (MALA). MALA fundamentally rectifies the "amplitude-ignoring" problem in linear attention by explicitly preserving the norm information of query vectors in the attention calculation, achieving efficient global modeling without an auxiliary compensation branch. 2) Replacing the DE module with Inception Depthwise Convolution (IDConv). IDConv borrows the Inception concept, decomposing large-kernel operations into efficient parallel branches (square, horizontal, and vertical strips), thereby capturing spectrogram features with extremely low parameter redundancy. Extensive experiments on the VoiceBank+DEMAND dataset demonstrate that, compared to the MUSE baseline, IMSE significantly reduces the parameter count by 16.8% (from 0.513M to 0.427M) while achieving competitive performance comparable to the state-of-the-art on the PESQ metric (3.373). This study sets a new benchmark for the trade-off between model size and speech quality in ultra-lightweight speech enhancement.