GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining

📅 2026-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual joint embedding methods struggle to effectively model dense cross-modal correspondences across multiple scales, limiting both understanding and generation performance. This work proposes the first unified framework that integrates multi-scale contrastive learning with diffusion-based generative objectives. By leveraging a spatial-temporal diffusion model, the approach jointly optimizes semantic alignment, temporal consistency, and cross-modal synthesis capabilities across varying granularities of audio-visual data. Evaluated on VGGSound, AudioSet, and Panda70M, the method significantly outperforms current state-of-the-art approaches in both audio-visual retrieval and generation tasks, achieving a deep integration of discriminative and generative cross-modal modeling.

Technology Category

Application Category

📝 Abstract
Recent advances in video-audio (V-A) understanding and generation have increasingly relied on joint V-A embeddings, which serve as the foundation for tasks such as cross-modal retrieval and generation. While prior methods like CAVP effectively model semantic and temporal correspondences between modalities using contrastive objectives, their performance remains suboptimal. A key limitation is the insufficient modeling of the dense, multi-scale nature of both video and audio signals, correspondences often span fine- to coarse-grained spatial-temporal structures, which are underutilized in existing frameworks. To this end, we propose GMS-CAVP, a novel framework that combines Multi-Scale Video-Audio Alignment and Multi-Scale Spatial-Temporal Diffusion-based pretraining objectives to enhance V-A correspondence modeling. First, GMS-CAVP introduces a multi-scale contrastive learning strategy that captures semantic and temporal relations across varying granularities. Second, we go beyond traditional contrastive learning by incorporating a diffusion-based generative objective, enabling modality translation and synthesis between video and audio. This unified discriminative-generative formulation facilitates deeper cross-modal understanding and paves the way for high-fidelity generation. Extensive experiments on VGGSound, AudioSet, and Panda70M demonstrate that GMS-CAVP outperforms previous methods in generation and retrieval.
Problem

Research questions and friction points this paper is trying to address.

video-audio correspondence
multi-scale modeling
cross-modal alignment
dense correspondence
spatial-temporal structure
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-scale contrastive learning
diffusion-based generative pretraining
audio-video correspondence
cross-modal generation
joint embedding
🔎 Similar Papers
No similar papers found.