An Investigation of Incorporating Mamba For Speech Enhancement

πŸ“… 2024-05-10
πŸ›οΈ Spoken Language Technology Workshop
πŸ“ˆ Citations: 59
✨ Influential: 10
πŸ“„ PDF

career value

213K/year
πŸ€– AI Summary
This work addresses speech enhancement (SE) by introducing Mambaβ€”a non-attention, scalable state-space model (SSM)β€”to end-to-end regression modeling for the first time, proposing the SEMamba architecture supporting both causal and non-causal configurations. To improve perceptual quality, we design a perceptual contrastive stretching (PCS) module, jointly optimized with signal-level and metric-oriented losses. Evaluated on the VoiceBank-DEMAND benchmark, SEMamba achieves a new state-of-the-art PESQ score of 3.69, reducing FLOPs by approximately 12% compared to leading Transformer-based methods. Moreover, as an ASR front-end, it demonstrates competitive performance. This study validates the effectiveness of SSMs for speech temporal modeling and establishes a novel paradigm for lightweight, efficient, and high-fidelity speech enhancement.

Technology Category

Application Category

πŸ“ Abstract
This work aims to investigate the use of a recently proposed, attention-free, scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task. In particular, we employ Mamba to deploy different regression-based SE models (SEMamba) with different configurations, namely basic, advanced, causal, and non-causal. Furthermore, loss functions either based on signal-level distances or metric-oriented are considered. Experimental evidence shows that SEMamba attains a competitive PESQ of 3.55 on the VoiceBank-DEMAND dataset with the advanced, non-causal configuration. A new state-of-the-art PESQ of 3.69 is also reported when SEMamba is combined with Perceptual Contrast Stretching (PCS). Compared against Transformed-based equivalent SE solutions, a noticeable FLOPs reduction up to $sim 12 %$ is observed with the advanced non-causal configurations. Finally, SEMamba can be used as a pre-processing step before automatic speech recognition (ASR), showing competitive performance against recent SE solutions.
Problem

Research questions and friction points this paper is trying to address.

Investigating Mamba state-space model for speech enhancement tasks
Developing regression-based models with causal and non-causal configurations
Reducing computational complexity while maintaining competitive enhancement performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Mamba state-space model for speech enhancement
Combines Mamba with perceptual contrast stretching
Reduces computational FLOPs compared to transformer models