π€ AI Summary
Existing audio-visual speech separation (AVSS) methods predominantly adopt non-causal architectures with high computational complexity, rendering them unsuitable for real-time streaming applications. To address this, we propose Swift-Netβa lightweight, strictly causal streaming model. Swift-Net introduces the first causal transformation template, integrates a compact ResNet-based visual encoder, and employs an audio-visual feature alignment and fusion mechanism. Furthermore, it incorporates a grouped Simple Recurrent Unit (SRU) module to enable efficient cross-modal temporal modeling under strict causality constraints, ensuring low-latency online inference. Evaluated on LRS2, LRS3, and VoxCeleb2, Swift-Net significantly outperforms prior causal streaming AVSS approaches in separation accuracy while maintaining superior computational efficiency. It achieves an optimal trade-off between fidelity and inference speed, establishing a new paradigm for real-time AVSS.
π Abstract
Audio-visual speech separation (AVSS) aims to extract a target speech signal from a mixed signal by leveraging both auditory and visual (lip movement) cues. However, most existing AVSS methods exhibit complex architectures and rely on future context, operating offline, which renders them unsuitable for real-time applications. Inspired by the pipeline of RTFSNet, we propose a novel streaming AVSS model, named Swift-Net, which enhances the causal processing capabilities required for real-time applications. Swift-Net adopts a lightweight visual feature extraction module and an efficient fusion module for audio-visual integration. Additionally, Swift-Net employs Grouped SRUs to integrate historical information across different feature spaces, thereby improving the utilization efficiency of historical information. We further propose a causal transformation template to facilitate the conversion of non-causal AVSS models into causal counterparts. Experiments on three standard benchmark datasets (LRS2, LRS3, and VoxCeleb2) demonstrated that under causal conditions, our proposed Swift-Net exhibited outstanding performance, highlighting the potential of this method for processing speech in complex environments.