Modulating State Space Model with SlowFast Framework for Compute-Efficient Ultra Low-Latency Speech Enhancement

📅 2024-11-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and poor real-time performance of deep learning models in ultra-low-latency (≤2 ms) speech enhancement, this paper proposes a SlowFast dual-branch architecture. The slow branch models long-term acoustic context at a low frame rate, while the fast branch employs a dynamically modulated state space model (SSM) in the time domain to achieve point-wise (62.5 μs) real-time enhancement. This work is the first to deeply integrate SSM’s dynamic modulation mechanism with the SlowFast paradigm, effectively decoupling computational load from processing latency. Experiments demonstrate that, compared to a single-branch baseline, the proposed method reduces computational demand by 70% (to 100 M MACs/s), achieves a PESQ-NB score of 3.12 and an SI-SDR of 16.62, thereby maintaining both high performance and high efficiency under extreme latency constraints.

Technology Category

Application Category

📝 Abstract
Deep learning-based speech enhancement (SE) methods often face significant computational challenges when needing to meet low-latency requirements because of the increased number of frames to be processed. This paper introduces the SlowFast framework which aims to reduce computation costs specifically when low-latency enhancement is needed. The framework consists of a slow branch that analyzes the acoustic environment at a low frame rate, and a fast branch that performs SE in the time domain at the needed higher frame rate to match the required latency. Specifically, the fast branch employs a state space model where its state transition process is dynamically modulated by the slow branch. Experiments on a SE task with a 2 ms algorithmic latency requirement using the Voice Bank + Demand dataset show that our approach reduces computation cost by 70% compared to a baseline single-branch network with equivalent parameters, without compromising enhancement performance. Furthermore, by leveraging the SlowFast framework, we implemented a network that achieves an algorithmic latency of just 62.5 {mu}s (one sample point at 16 kHz sample rate) with a computation cost of 100 M MACs/s, while scoring a PESQ-NB of 3.12 and SISNR of 16.62.
Problem

Research questions and friction points this paper is trying to address.

Real-time Processing
Deep Learning
Speech Enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

SlowFast framework
real-time speech enhancement
computational efficiency
🔎 Similar Papers
No similar papers found.
Longbiao Cheng
Longbiao Cheng
Institute of Neuroinformatics, University of Zurich and ETH Zurich
compute-efficient neural networksaudio signal processingdeep learning
A
Ashutosh Pandey
Reality Labs Research, Meta, Redmond, United States
Buye Xu
Buye Xu
Meta Reality Labs Research
T
T. Delbruck
Institute of Neuroinformatics, University of Zurich and ETH Zurich, Zurich, Switzerland
V
V. Ithapu
Reality Labs Research, Meta, Redmond, United States
Shih-Chii Liu
Shih-Chii Liu
Institute of Neuroinformatics, University of Zurich & ETH Zurich
Spiking neuromorphic sensorsevent-driven deep learningneuromorphic computingBM interfaces