Lightweight DNN for Full-Band Speech Denoising on Mobile Devices: Exploiting Long and Short Temporal Patterns

📅 2025-09-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of simultaneously achieving low latency, model lightweightness, and high fidelity in mobile full-band speech enhancement, this paper proposes a causal, real-time deep neural network architecture. Methodologically, it integrates a lookback frame mechanism, latency-controllable temporal convolutions, lightweight recurrent modeling (e.g., GRU), and inverted bottleneck structures, while employing causal instance normalization to ensure streaming inference stability. The network takes STFT magnitude spectra as input and leverages an improved U-Net to jointly model short-term and long-term temporal patterns. Evaluated on mainstream smartphones, the system achieves end-to-end processing latency below 20 ms (real-time factor < 0.02). It significantly outperforms existing low-latency full-band methods in SI-SDR and attains state-of-the-art performance on public benchmarks including DNS-Challenge, demonstrating its effectiveness and practicality under resource-constrained conditions.

Technology Category

Application Category

📝 Abstract
Speech denoising (SD) is an important task of many, if not all, modern signal processing chains used in devices and for everyday-life applications. While there are many published and powerful deep neural network (DNN)-based methods for SD, few are optimized for resource-constrained platforms such as mobile devices. Additionally, most DNN-based methods for SD are not focusing on full-band (FB) signals, i.e. having 48 kHz sampling rate, and/or low latency cases. In this paper we present a causal, low latency, and lightweight DNN-based method for full-band SD, leveraging both short and long temporal patterns. The method is based on a modified UNet architecture employing look-back frames, temporal spanning of convolutional kernels, and recurrent neural networks for exploiting short and long temporal patterns in the signal and estimated denoising mask. The DNN operates on a causal frame-by-frame basis taking as an input the STFT magnitude, utilizes inverted bottlenecks inspired by MobileNet, employs causal instance normalization for channel-wise normalization, and achieves a real-time factor below 0.02 when deployed on a modern mobile phone. The proposed method is evaluated using established speech denoising metrics and publicly available datasets, demonstrating its effectiveness in achieving an (SI-)SDR value that outperforms existing FB and low latency SD methods.
Problem

Research questions and friction points this paper is trying to address.

Lightweight DNN for full-band speech denoising on mobile devices
Optimizing denoising for resource-constrained platforms with low latency
Exploiting both short and long temporal patterns in signals
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight DNN optimized for mobile devices
Causal UNet with short and long temporal patterns
Inverted bottlenecks and causal instance normalization
🔎 Similar Papers
No similar papers found.
Konstantinos Drossos
Konstantinos Drossos
Principal Scientist, Audio Machine Learning at Nokia Technologies
multimodal learningaudio processingmachine listening
M
Mikko Heikkinen
Nokia Technologies, Tampere, Finland
P
Paschalis Tsiaflakis
Nokia Bell Labs, Antwerp, Belgium