🤖 AI Summary
To address real-time speech enhancement on resource-constrained edge devices, this paper proposes a lightweight U-Net architecture integrated with a reverse soft attention mechanism. The design significantly reduces model parameters while enhancing modeling capability for salient speech features, and enables low-latency, end-to-end inference with optimized GPU computational efficiency. Evaluated on standard benchmarks, the proposed method achieves a 0.64 PESQ improvement and a 6.24% relative reduction in word error rate (WER) over same-scale baseline models, alongside marked gains in speech intelligibility and subjective quality. The core contribution lies in embedding reverse attention into the encoder-decoder pathways of the lightweight U-Net, effectively balancing model compactness and enhancement performance. This yields an efficient, practical solution for real-time speech enhancement at the edge.
📝 Abstract
This paper introduces a lightweight deep learning model for real-time speech enhancement, designed to operate efficiently on resource-constrained devices. The proposed model leverages a compact architecture that facilitates rapid inference without compromising performance. Key contributions include infusing soft attention-based attention gates in the U-Net architecture which is known to perform well for segmentation tasks and is optimized for GPUs. Experimental evaluations demonstrate that the model achieves competitive speech quality and intelligibility metrics, such as PESQ and Word Error Rates (WER), improving the performance of similarly sized baseline models. We are able to achieve a 6.24% WER improvement and a 0.64 PESQ score improvement over un-enhanced waveforms.