LORT: Locally Refined Convolution and Taylor Transformer for Monaural Speech Enhancement

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In single-channel speech enhancement, achieving both low parameter count and high performance remains challenging. This paper proposes LORT, a lightweight U-Net architecture that jointly models magnitude and phase spectra via Local Refinement Convolution (LRC) and a Spatial-Channel Enhanced Taylor Transformer—incorporating Taylor-based Multi-Head Self-Attention (T-MSA) and Spatial-Channel Enhancement Attention (SCEA). To improve noise robustness, LORT integrates gated feed-forward networks, a multi-scale discriminator, and a composite loss function. Evaluated on VCTK+DEMAND and the DNS Challenge datasets, LORT achieves state-of-the-art or competitive performance with only 0.96M parameters—significantly reducing computational overhead compared to mainstream models—while maintaining strong generalization and practical deployability.

Technology Category

Application Category

📝 Abstract
Achieving superior enhancement performance while maintaining a low parameter count and computational complexity remains a challenge in the field of speech enhancement. In this paper, we introduce LORT, a novel architecture that integrates spatial-channel enhanced Taylor Transformer and locally refined convolution for efficient and robust speech enhancement. We propose a Taylor multi-head self-attention (T-MSA) module enhanced with spatial-channel enhancement attention (SCEA), designed to facilitate inter-channel information exchange and alleviate the spatial attention limitations inherent in Taylor-based Transformers. To complement global modeling, we further present a locally refined convolution (LRC) block that integrates convolutional feed-forward layers, time-frequency dense local convolutions, and gated units to capture fine-grained local details. Built upon a U-Net-like encoder-decoder structure with only 16 output channels in the encoder, LORT processes noisy inputs through multi-resolution T-MSA modules using alternating downsampling and upsampling operations. The enhanced magnitude and phase spectra are decoded independently and optimized through a composite loss function that jointly considers magnitude, complex, phase, discriminator, and consistency objectives. Experimental results on the VCTK+DEMAND and DNS Challenge datasets demonstrate that LORT achieves competitive or superior performance to state-of-the-art (SOTA) models with only 0.96M parameters, highlighting its effectiveness for real-world speech enhancement applications with limited computational resources.
Problem

Research questions and friction points this paper is trying to address.

Achieving superior speech enhancement with low computational complexity
Overcoming spatial attention limitations in Taylor-based Transformers
Capturing fine-grained local details while maintaining global modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates spatial-channel enhanced Taylor Transformer for attention
Uses locally refined convolution blocks for local details
Employs U-Net encoder-decoder with multi-resolution T-MSA modules
🔎 Similar Papers
No similar papers found.