IMSE: Efficient U-Net-based Speech Enhancement using Inception Depthwise Convolution and Amplitude-Aware Linear Attention

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Speech enhancement for resource-constrained devices demands a careful trade-off between model efficiency and performance. To address this, we propose IMSE, an ultra-lightweight U-Net architecture. First, we introduce Magnitude-Aware Linear Attention (MALA), a novel attention mechanism that explicitly incorporates amplitude information to enable efficient global spectral modeling. Second, we design Inception-style Depthwise Convolution (IDConv), which captures multi-scale spectral features with minimal parameter overhead, replacing computationally expensive large-kernel convolutions. Evaluated on the VoiceBank+DEMAND corpus, IMSE achieves only 0.427M parameters—16.8% fewer than MUSE—while attaining a PESQ score of 3.373, matching state-of-the-art performance. The architecture significantly improves computational efficiency and deployment feasibility on edge devices without compromising perceptual quality.

Technology Category

Application Category

📝 Abstract
Achieving a balance between lightweight design and high performance remains a significant challenge for speech enhancement (SE) tasks on resource-constrained devices. Existing state-of-the-art methods, such as MUSE, have established a strong baseline with only 0.51M parameters by introducing a Multi-path Enhanced Taylor (MET) transformer and Deformable Embedding (DE). However, an in-depth analysis reveals that MUSE still suffers from efficiency bottlenecks: the MET module relies on a complex "approximate-compensate" mechanism to mitigate the limitations of Taylor-expansion-based attention, while the offset calculation for deformable embedding introduces additional computational burden. This paper proposes IMSE, a systematically optimized and ultra-lightweight network. We introduce two core innovations: 1) Replacing the MET module with Amplitude-Aware Linear Attention (MALA). MALA fundamentally rectifies the "amplitude-ignoring" problem in linear attention by explicitly preserving the norm information of query vectors in the attention calculation, achieving efficient global modeling without an auxiliary compensation branch. 2) Replacing the DE module with Inception Depthwise Convolution (IDConv). IDConv borrows the Inception concept, decomposing large-kernel operations into efficient parallel branches (square, horizontal, and vertical strips), thereby capturing spectrogram features with extremely low parameter redundancy. Extensive experiments on the VoiceBank+DEMAND dataset demonstrate that, compared to the MUSE baseline, IMSE significantly reduces the parameter count by 16.8% (from 0.513M to 0.427M) while achieving competitive performance comparable to the state-of-the-art on the PESQ metric (3.373). This study sets a new benchmark for the trade-off between model size and speech quality in ultra-lightweight speech enhancement.
Problem

Research questions and friction points this paper is trying to address.

Balancing lightweight design with high performance for speech enhancement on resource-constrained devices
Addressing efficiency bottlenecks in existing methods like MUSE's complex attention mechanisms
Reducing computational burden while maintaining competitive speech quality metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Amplitude-Aware Linear Attention preserves query norm
Inception Depthwise Convolution replaces deformable embedding
Ultra-lightweight U-Net reduces parameters by 16.8 percent
🔎 Similar Papers
No similar papers found.