Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations

📅 2025-07-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of real-time target speech enhancement in multi-speaker scenarios under low signal-to-noise ratio (SNR) conditions, this paper proposes RAVEN, an end-to-end real-time audio-visual speech enhancement system. RAVEN introduces the first open-source, CPU-only streaming implementation, innovatively fusing visual embeddings extracted from pre-trained audio-visual speech recognition (AVSR) and active speaker detection (ASD) models to guide a time-domain neural network for speaker-separated speech enhancement. Unlike unimodal approaches, the joint AVSR–ASD embedding strategy significantly improves target speech intelligibility and quality, achieving state-of-the-art (SOTA) performance under severe multi-talker interference and strong background noise. The system operates end-to-end in real time with low latency and is fully open-sourced, including code and demonstration videos.

Technology Category

Application Category

📝 Abstract
Speech enhancement in audio-only settings remains challenging, particularly in the presence of interfering speakers. This paper presents a simple yet effective real-time audio-visual speech enhancement (AVSE) system, RAVEN, which isolates and enhances the on-screen target speaker while suppressing interfering speakers and background noise. We investigate how visual embeddings learned from audio-visual speech recognition (AVSR) and active speaker detection (ASD) contribute to AVSE across different SNR conditions and numbers of interfering speakers. Our results show concatenating embeddings from AVSR and ASD models provides the greatest improvement in low-SNR, multi-speaker environments, while AVSR embeddings alone perform best in noise-only scenarios. In addition, we develop a real-time streaming system that operates on a computer CPU and we provide a video demonstration and code repository. To our knowledge, this is the first open-source implementation of a real-time AVSE system.
Problem

Research questions and friction points this paper is trying to address.

Enhancing speech in noisy multi-speaker environments
Utilizing pre-trained visual embeddings for audio-visual speech enhancement
Developing real-time AVSE system for CPU-based operation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-trained AVSR and ASD visual embeddings
Real-time streaming on computer CPU
Open-source AVSE system implementation
🔎 Similar Papers
No similar papers found.
T
Teng Ma
School of Music, Georgia Institute of Technology, United States
S
Sile Yin
Research, Bose Corporation, United States
Li-Chia Yang
Li-Chia Yang
Bose Corporation
Deep LearningMusic Information RetrievalSpeech Enhancement
S
Shuo Zhang
Research, Bose Corporation, United States