LASER: Lip Landmark Assisted Speaker Detection for Robustness

📅 2025-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the degraded robustness of active speaker detection (ASD) in multi-speaker videos caused by audio-visual asynchrony, this paper proposes an explicit lip-motion modeling approach. Methodologically, it introduces 2D lip keypoints into the ASD training framework for the first time, integrating a lightweight lip detector and a coordinate-to-dense-feature-map encoding mechanism. A dual-branch audio-visual feature alignment module is designed, coupled with a novel lip-face consistency loss to ensure model stability even under lip detection failure. Evaluated on standard benchmarks including AVSpeech and VoxCeleb2, the method achieves significant improvements over state-of-the-art approaches. Notably, it demonstrates superior accuracy and robustness under challenging conditions—such as severe audio-visual asynchrony, low-resolution inputs, and facial occlusions—highlighting its effectiveness in real-world multi-speaker scenarios.

Technology Category

Application Category

📝 Abstract
Active Speaker Detection (ASD) aims to identify speaking individuals in complex visual scenes. While humans can easily detect speech by matching lip movements to audio, current ASD models struggle to establish this correspondence, often misclassifying non-speaking instances when audio and lip movements are unsynchronized. To address this limitation, we propose Lip landmark Assisted Speaker dEtection for Robustness (LASER). Unlike models that rely solely on facial frames, LASER explicitly focuses on lip movements by integrating lip landmarks in training. Specifically, given a face track, LASER extracts frame-level visual features and the 2D coordinates of lip landmarks using a lightweight detector. These coordinates are encoded into dense feature maps, providing spatial and structural information on lip positions. Recognizing that landmark detectors may sometimes fail under challenging conditions (e.g., low resolution, occlusions, extreme angles), we incorporate an auxiliary consistency loss to align predictions from both lip-aware and face-only features, ensuring reliable performance even when lip data is absent. Extensive experiments across multiple datasets show that LASER outperforms state-of-the-art models, especially in scenarios with desynchronized audio and visuals, demonstrating robust performance in real-world video contexts. Code is available at url{https://github.com/plnguyen2908/LASER_ASD}.
Problem

Research questions and friction points this paper is trying to address.

Activity Speaker Detection
Complex Scenes
Lip-Sound Mismatch
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lip Landmark Assisted
Speaker Detection Robustness
Activity Speaker Detection (ASD) Enhancement
🔎 Similar Papers
No similar papers found.