CueNet: Robust Audio-Visual Speaker Extraction through Cross-Modal Cue Mining and Interaction

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance degradation of audio-visual speaker extraction under impaired visual inputs by proposing a robust method that does not require training data with degraded video. Inspired by human perceptual mechanisms, it introduces for the first time a multi-cue disentanglement and interaction framework for this task. The approach explicitly models three types of cross-modal cues—speaker identity, acoustic synchronization, and semantic synchronization—from both audio and visual streams, and integrates them through dedicated interaction modules within an end-to-end fusion network. Experimental results demonstrate that the proposed method significantly outperforms existing approaches across various visual degradation scenarios, effectively enhancing model robustness without relying on degraded video during training.

Technology Category

Application Category

📝 Abstract
Audio-visual speaker extraction has attracted increasing attention, as it removes the need for pre-registered speech and leverages the visual modality as a complement to audio. Although existing methods have achieved impressive performance, the issue of degraded visual inputs has received relatively little attention, despite being common in real-world scenarios. Previous attempts to address this problem have mainly involved training with degraded visual data. However, visual degradation can occur in many unpredictable ways, making it impractical to simulate all possible cases during training. In this paper, we aim to enhance the robustness of audio-visual speaker extraction against impaired visual inputs without relying on degraded videos during training. Inspired by observations from human perceptual mechanisms, we propose an audio-visual learner that disentangles speaker information, acoustic synchronisation, and semantic synchronisation as distinct cues. Furthermore, we design a dedicated interaction module that effectively integrates these cues to provide a reliable guidance signal for speaker extraction. Extensive experiments demonstrate the strong robustness of the proposed model under various visual degradations and its clear superiority over existing methods.
Problem

Research questions and friction points this paper is trying to address.

audio-visual speaker extraction
visual degradation
robustness
cross-modal cue
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal cue mining
audio-visual speaker extraction
robustness to visual degradation
cue disentanglement
multimodal interaction
🔎 Similar Papers
No similar papers found.