DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap

📅 2025-03-15

🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing cross-modal alignment methods (e.g., CLAP, CAVP) rely solely on unidirectional contrastive loss, neglecting bidirectional modality interaction and inherent cross-modal noise—thereby limiting semantic consistency and robustness. To address this, we propose Contrastive Bidirectional Diffusion (CBD), the first framework to integrate lightweight bidirectional conditional diffusion modules directly into the contrastive learning space. CBD enables mutual conditional denoising between text/video and audio embeddings, explicitly modeling cross-modal noise and interaction without reconstructing raw signals—thus balancing efficiency and semantic fidelity. Leveraging feature transfer from pretrained CLAP/CAVP backbones, CBD achieves an average 9.2% improvement in R@1 for audio-video–text retrieval and generation on VGGSound and AudioCaps. Moreover, it operates at 3.8× the inference speed of generative baselines while maintaining competitive alignment quality.

Technology Category

Application Category

📝 Abstract

Recent works in cross-modal understanding and generation, notably through models like CLAP (Contrastive Language-Audio Pretraining) and CAVP (Contrastive Audio-Visual Pretraining), have significantly enhanced the alignment of text, video, and audio embeddings via a single contrastive loss. However, these methods often overlook the bidirectional interactions and inherent noises present in each modality, which can crucially impact the quality and efficacy of cross-modal integration. To address this limitation, we introduce DiffGAP, a novel approach incorporating a lightweight generative module within the contrastive space. Specifically, our DiffGAP employs a bidirectional diffusion process tailored to bridge the cross-modal gap more effectively. This involves a denoising process on text and video embeddings conditioned on audio embeddings and vice versa, thus facilitating a more nuanced and robust cross-modal interaction. Our experimental results on VGGSound and AudioCaps datasets demonstrate that DiffGAP significantly improves performance in video/text-audio generation and retrieval tasks, confirming its effectiveness in enhancing cross-modal understanding and generation capabilities.

Problem

Research questions and friction points this paper is trying to address.

Addresses bidirectional interactions in cross-modal understanding.

Reduces inherent noise in text, video, and audio embeddings.

Enhances cross-modal integration quality and efficacy.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight generative module in contrastive space

Bidirectional diffusion process for cross-modal gap

Denoising text/video embeddings conditioned on audio

🔎 Similar Papers

No similar papers found.