Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations

📅 2025-07-28

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

To address the challenge of real-time target speech enhancement in multi-speaker scenarios under low signal-to-noise ratio (SNR) conditions, this paper proposes RAVEN, an end-to-end real-time audio-visual speech enhancement system. RAVEN introduces the first open-source, CPU-only streaming implementation, innovatively fusing visual embeddings extracted from pre-trained audio-visual speech recognition (AVSR) and active speaker detection (ASD) models to guide a time-domain neural network for speaker-separated speech enhancement. Unlike unimodal approaches, the joint AVSR–ASD embedding strategy significantly improves target speech intelligibility and quality, achieving state-of-the-art (SOTA) performance under severe multi-talker interference and strong background noise. The system operates end-to-end in real time with low latency and is fully open-sourced, including code and demonstration videos.

Technology Category

Application Category

📝 Abstract

Speech enhancement in audio-only settings remains challenging, particularly in the presence of interfering speakers. This paper presents a simple yet effective real-time audio-visual speech enhancement (AVSE) system, RAVEN, which isolates and enhances the on-screen target speaker while suppressing interfering speakers and background noise. We investigate how visual embeddings learned from audio-visual speech recognition (AVSR) and active speaker detection (ASD) contribute to AVSE across different SNR conditions and numbers of interfering speakers. Our results show concatenating embeddings from AVSR and ASD models provides the greatest improvement in low-SNR, multi-speaker environments, while AVSR embeddings alone perform best in noise-only scenarios. In addition, we develop a real-time streaming system that operates on a computer CPU and we provide a video demonstration and code repository. To our knowledge, this is the first open-source implementation of a real-time AVSE system.

Problem

Research questions and friction points this paper is trying to address.

Enhancing speech in noisy multi-speaker environments

Utilizing pre-trained visual embeddings for audio-visual speech enhancement

Developing real-time AVSE system for CPU-based operation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-trained AVSR and ASD visual embeddings

Real-time streaming on computer CPU

Open-source AVSE system implementation

🔎 Similar Papers

No similar papers found.