Leveraging Mamba with Full-Face Vision for Audio-Visual Speech Enhancement

📅 2025-08-19

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Existing Mamba-based speech enhancement models (e.g., SEMamba) are limited to single-speaker scenarios and underperform in multi-talker cocktail-party environments with severe speaker overlap. To address this, we propose the first audio-visual speech enhancement model integrating full-face spatiotemporal visual features with the Mamba architecture. Our method employs a dedicated audio-visual encoder to separately model lip dynamics and facial spatial characteristics, leverages Mamba’s selective state-space modeling for efficient long-range temporal audio dependency capture, and introduces a novel cross-modal fusion mechanism for end-to-end target speech separation and reconstruction. Evaluated under challenging single-channel conditions, the approach significantly improves speech intelligibility and naturalness in multi-speaker settings. On the AVSEC-4 Challenge single-channel track, it achieves state-of-the-art performance—outperforming all monaural baselines across STOI, PESQ, and UTMOS—and ranks first overall.

Technology Category

Application Category

📝 Abstract

Recent Mamba-based models have shown promise in speech enhancement by efficiently modeling long-range temporal dependencies. However, models like Speech Enhancement Mamba (SEMamba) remain limited to single-speaker scenarios and struggle in complex multi-speaker environments such as the cocktail party problem. To overcome this, we introduce AVSEMamba, an audio-visual speech enhancement model that integrates full-face visual cues with a Mamba-based temporal backbone. By leveraging spatiotemporal visual information, AVSEMamba enables more accurate extraction of target speech in challenging conditions. Evaluated on the AVSEC-4 Challenge development and blind test sets, AVSEMamba outperforms other monaural baselines in speech intelligibility (STOI), perceptual quality (PESQ), and non-intrusive quality (UTMOS), and achieves extbf{1st place} on the monaural leaderboard.

Problem

Research questions and friction points this paper is trying to address.

Enhancing speech in multi-speaker cocktail party environments

Integrating full-face visual cues with Mamba-based audio processing

Improving target speech extraction accuracy in challenging conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates full-face visual cues

Uses Mamba-based temporal backbone

Leverages spatiotemporal visual information

🔎 Similar Papers

Mamba in Speech: Towards an Alternative to Self-Attention

2024-05-21arXiv.orgCitations: 35