🤖 AI Summary
This work addresses the challenge of modality misalignment in audio-visual learning, often caused by off-screen sound sources and background interference, which leads to training instability and degraded representations. To this end, the authors propose the CAE-AV framework, which introduces a novel joint mechanism of caption-guided semantic alignment and cross-modal consistency. Specifically, the CASTE module dynamically models spatio-temporal relationships to focus on salient frames, while the CASE module leverages caption alignment to enhance cross-modal correspondence. The framework is further augmented with lightweight contrastive objectives, including caption-to-modality InfoNCE and entropy regularization. Evaluated with frozen backbones, CAE-AV achieves state-of-the-art performance across four major benchmarks—AVE, AVVP, AVS, and AVQA—demonstrating significantly improved robustness and alignment accuracy, with qualitative results confirming its strong resilience to modality misalignment.
📝 Abstract
Audio-visual learning suffers from modality misalignment caused by off-screen sources and background clutter, and current methods usually amplify irrelevant regions or moments, leading to unstable training and degraded representation quality. To address this challenge, we proposed a novel Caption-aligned and Agreement-guided Enhancement framework (CAE-AV) for audio-visual learning, which used two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE) to relieve audio-visual misalignment. CASTE dynamically balances spatial and temporal relations by evaluating frame-level audio-visual agreement, ensuring that key information is captured from both preceding and subsequent frames under misalignment. CASE injects cross-modal semantic guidance into selected spatio-temporal positions, leveraging high-level semantic cues to further alleviate misalignment. In addition, we design lightweight objectives, caption-to-modality InfoNCE, visual-audio consistency, and entropy regularization to guide token selection and strengthen cross-modal semantic alignment. With frozen backbones, CAE-AV achieves state-of-the-art performance on AVE, AVVP, AVS, and AVQA benchmarks, and qualitative analyses further validate its robustness against audio-visual misalignment.