CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

📅 2025-05-02

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

Existing audio-visual representation learning methods rely on global audio representations, limiting fine-grained frame-level temporal alignment modeling, while suffering from optimization conflicts between contrastive learning and reconstruction objectives. This paper proposes a self-supervised framework addressing these issues. First, it introduces the first audio-video frame-level sequence alignment modeling mechanism. Second, it decouples global contrastive learning from masked reconstruction, eliminating multi-task interference via a dedicated global token disentanglement design. Third, it incorporates learnable register tokens to enhance spatial localization capability. The method jointly integrates multimodal sequence modeling, contrastive learning, and masked autoencoding. Evaluated on AudioSet, VGGSound, and ADE20K Sound, it achieves state-of-the-art performance across zero-shot retrieval, classification, and sound localization—outperforming more complex architectures despite its conceptual simplicity and efficiency.

Technology Category

Application Category

📝 Abstract

Recent advances in audio-visual learning have shown promising results in learning representations across modalities. However, most approaches rely on global audio representations that fail to capture fine-grained temporal correspondences with visual frames. Additionally, existing methods often struggle with conflicting optimization objectives when trying to jointly learn reconstruction and cross-modal alignment. In this work, we propose CAV-MAE Sync as a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning. We address three key challenges: First, we tackle the granularity mismatch between modalities by treating audio as a temporal sequence aligned with video frames, rather than using global representations. Second, we resolve conflicting optimization goals by separating contrastive and reconstruction objectives through dedicated global tokens. Third, we improve spatial localization by introducing learnable register tokens that reduce semantic load on patch tokens. We evaluate the proposed approach on AudioSet, VGG Sound, and the ADE20K Sound dataset on zero-shot retrieval, classification and localization tasks demonstrating state-of-the-art performance and outperforming more complex architectures.

Problem

Research questions and friction points this paper is trying to address.

Address granularity mismatch between audio and visual modalities

Resolve conflicting optimization goals in cross-modal learning

Improve spatial localization with learnable register tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns audio as temporal sequence with video frames

Separates contrastive and reconstruction objectives via tokens

Introduces learnable register tokens for spatial localization

🔎 Similar Papers

Progressive Confident Masking Attention Network for Audio-Visual Segmentation