A Fast and Lightweight Model for Causal Audio-Visual Speech Separation

📅 2025-06-07

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing audio-visual speech separation (AVSS) methods predominantly adopt non-causal architectures with high computational complexity, rendering them unsuitable for real-time streaming applications. To address this, we propose Swift-Net—a lightweight, strictly causal streaming model. Swift-Net introduces the first causal transformation template, integrates a compact ResNet-based visual encoder, and employs an audio-visual feature alignment and fusion mechanism. Furthermore, it incorporates a grouped Simple Recurrent Unit (SRU) module to enable efficient cross-modal temporal modeling under strict causality constraints, ensuring low-latency online inference. Evaluated on LRS2, LRS3, and VoxCeleb2, Swift-Net significantly outperforms prior causal streaming AVSS approaches in separation accuracy while maintaining superior computational efficiency. It achieves an optimal trade-off between fidelity and inference speed, establishing a new paradigm for real-time AVSS.

Technology Category

Application Category

📝 Abstract

Audio-visual speech separation (AVSS) aims to extract a target speech signal from a mixed signal by leveraging both auditory and visual (lip movement) cues. However, most existing AVSS methods exhibit complex architectures and rely on future context, operating offline, which renders them unsuitable for real-time applications. Inspired by the pipeline of RTFSNet, we propose a novel streaming AVSS model, named Swift-Net, which enhances the causal processing capabilities required for real-time applications. Swift-Net adopts a lightweight visual feature extraction module and an efficient fusion module for audio-visual integration. Additionally, Swift-Net employs Grouped SRUs to integrate historical information across different feature spaces, thereby improving the utilization efficiency of historical information. We further propose a causal transformation template to facilitate the conversion of non-causal AVSS models into causal counterparts. Experiments on three standard benchmark datasets (LRS2, LRS3, and VoxCeleb2) demonstrated that under causal conditions, our proposed Swift-Net exhibited outstanding performance, highlighting the potential of this method for processing speech in complex environments.

Problem

Research questions and friction points this paper is trying to address.

Real-time audio-visual speech separation with lightweight design

Causal processing for online applications without future context

Efficient fusion of auditory and visual cues for speech extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight visual feature extraction module

Efficient fusion for audio-visual integration

Causal transformation template for real-time

🔎 Similar Papers

No similar papers found.