🤖 AI Summary
In single-channel speech enhancement, achieving both low parameter count and high performance remains challenging. This paper proposes LORT, a lightweight U-Net architecture that jointly models magnitude and phase spectra via Local Refinement Convolution (LRC) and a Spatial-Channel Enhanced Taylor Transformer—incorporating Taylor-based Multi-Head Self-Attention (T-MSA) and Spatial-Channel Enhancement Attention (SCEA). To improve noise robustness, LORT integrates gated feed-forward networks, a multi-scale discriminator, and a composite loss function. Evaluated on VCTK+DEMAND and the DNS Challenge datasets, LORT achieves state-of-the-art or competitive performance with only 0.96M parameters—significantly reducing computational overhead compared to mainstream models—while maintaining strong generalization and practical deployability.
📝 Abstract
Achieving superior enhancement performance while maintaining a low parameter count and computational complexity remains a challenge in the field of speech enhancement. In this paper, we introduce LORT, a novel architecture that integrates spatial-channel enhanced Taylor Transformer and locally refined convolution for efficient and robust speech enhancement. We propose a Taylor multi-head self-attention (T-MSA) module enhanced with spatial-channel enhancement attention (SCEA), designed to facilitate inter-channel information exchange and alleviate the spatial attention limitations inherent in Taylor-based Transformers. To complement global modeling, we further present a locally refined convolution (LRC) block that integrates convolutional feed-forward layers, time-frequency dense local convolutions, and gated units to capture fine-grained local details. Built upon a U-Net-like encoder-decoder structure with only 16 output channels in the encoder, LORT processes noisy inputs through multi-resolution T-MSA modules using alternating downsampling and upsampling operations. The enhanced magnitude and phase spectra are decoded independently and optimized through a composite loss function that jointly considers magnitude, complex, phase, discriminator, and consistency objectives. Experimental results on the VCTK+DEMAND and DNS Challenge datasets demonstrate that LORT achieves competitive or superior performance to state-of-the-art (SOTA) models with only 0.96M parameters, highlighting its effectiveness for real-world speech enhancement applications with limited computational resources.