CAE-AV: Improving Audio-Visual Learning via Cross-modal Interactive Enrichment

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of modality misalignment in audio-visual learning, often caused by off-screen sound sources and background interference, which leads to training instability and degraded representations. To this end, the authors propose the CAE-AV framework, which introduces a novel joint mechanism of caption-guided semantic alignment and cross-modal consistency. Specifically, the CASTE module dynamically models spatio-temporal relationships to focus on salient frames, while the CASE module leverages caption alignment to enhance cross-modal correspondence. The framework is further augmented with lightweight contrastive objectives, including caption-to-modality InfoNCE and entropy regularization. Evaluated with frozen backbones, CAE-AV achieves state-of-the-art performance across four major benchmarks—AVE, AVVP, AVS, and AVQA—demonstrating significantly improved robustness and alignment accuracy, with qualitative results confirming its strong resilience to modality misalignment.

Technology Category

Application Category

📝 Abstract
Audio-visual learning suffers from modality misalignment caused by off-screen sources and background clutter, and current methods usually amplify irrelevant regions or moments, leading to unstable training and degraded representation quality. To address this challenge, we proposed a novel Caption-aligned and Agreement-guided Enhancement framework (CAE-AV) for audio-visual learning, which used two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE) to relieve audio-visual misalignment. CASTE dynamically balances spatial and temporal relations by evaluating frame-level audio-visual agreement, ensuring that key information is captured from both preceding and subsequent frames under misalignment. CASE injects cross-modal semantic guidance into selected spatio-temporal positions, leveraging high-level semantic cues to further alleviate misalignment. In addition, we design lightweight objectives, caption-to-modality InfoNCE, visual-audio consistency, and entropy regularization to guide token selection and strengthen cross-modal semantic alignment. With frozen backbones, CAE-AV achieves state-of-the-art performance on AVE, AVVP, AVS, and AVQA benchmarks, and qualitative analyses further validate its robustness against audio-visual misalignment.
Problem

Research questions and friction points this paper is trying to address.

audio-visual learning
modality misalignment
off-screen sources
background clutter
cross-modal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal alignment
audio-visual learning
spatio-temporal enrichment
semantic guidance
modality misalignment
🔎 Similar Papers
No similar papers found.
Y
Yunzuo Hu
School of Information Science and Engineering, East China University of Science and Technology (ECUST), Shanghai 200237, P. R. China
W
Wen Li
School of Information Science and Engineering, East China University of Science and Technology (ECUST), Shanghai 200237, P. R. China
Jing Zhang
Jing Zhang
East China University of Science and Technology
computer visionimage understanding