DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap

📅 2025-03-15
🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing cross-modal alignment methods (e.g., CLAP, CAVP) rely solely on unidirectional contrastive loss, neglecting bidirectional modality interaction and inherent cross-modal noise—thereby limiting semantic consistency and robustness. To address this, we propose Contrastive Bidirectional Diffusion (CBD), the first framework to integrate lightweight bidirectional conditional diffusion modules directly into the contrastive learning space. CBD enables mutual conditional denoising between text/video and audio embeddings, explicitly modeling cross-modal noise and interaction without reconstructing raw signals—thus balancing efficiency and semantic fidelity. Leveraging feature transfer from pretrained CLAP/CAVP backbones, CBD achieves an average 9.2% improvement in R@1 for audio-video–text retrieval and generation on VGGSound and AudioCaps. Moreover, it operates at 3.8× the inference speed of generative baselines while maintaining competitive alignment quality.

Technology Category

Application Category

📝 Abstract
Recent works in cross-modal understanding and generation, notably through models like CLAP (Contrastive Language-Audio Pretraining) and CAVP (Contrastive Audio-Visual Pretraining), have significantly enhanced the alignment of text, video, and audio embeddings via a single contrastive loss. However, these methods often overlook the bidirectional interactions and inherent noises present in each modality, which can crucially impact the quality and efficacy of cross-modal integration. To address this limitation, we introduce DiffGAP, a novel approach incorporating a lightweight generative module within the contrastive space. Specifically, our DiffGAP employs a bidirectional diffusion process tailored to bridge the cross-modal gap more effectively. This involves a denoising process on text and video embeddings conditioned on audio embeddings and vice versa, thus facilitating a more nuanced and robust cross-modal interaction. Our experimental results on VGGSound and AudioCaps datasets demonstrate that DiffGAP significantly improves performance in video/text-audio generation and retrieval tasks, confirming its effectiveness in enhancing cross-modal understanding and generation capabilities.
Problem

Research questions and friction points this paper is trying to address.

Addresses bidirectional interactions in cross-modal understanding.
Reduces inherent noise in text, video, and audio embeddings.
Enhances cross-modal integration quality and efficacy.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight generative module in contrastive space
Bidirectional diffusion process for cross-modal gap
Denoising text/video embeddings conditioned on audio
🔎 Similar Papers
No similar papers found.
S
Shentong Mo
Department of CST, Tsinghua University, Beijing 100084, China; Department of Machine Learning, Carnegie Mellon University, Pittsburgh 15213, USA; Department of Machine Learning, MBZUAI, Abu Dhabi, UAE; Shengshu AI
Zehua Chen
Zehua Chen
PostDoc at Tsinghua University | Ph.D. from Imperial College
Generative ModelsMulti-modal GenerationHealth Monitoring
Fan Bao
Fan Bao
ShengShu
machine learning
J
Jun Zhu
Department of CST, Tsinghua University, Beijing 100084, China; Shengshu AI