🤖 AI Summary
Existing cross-modal alignment methods (e.g., CLAP, CAVP) rely solely on unidirectional contrastive loss, neglecting bidirectional modality interaction and inherent cross-modal noise—thereby limiting semantic consistency and robustness. To address this, we propose Contrastive Bidirectional Diffusion (CBD), the first framework to integrate lightweight bidirectional conditional diffusion modules directly into the contrastive learning space. CBD enables mutual conditional denoising between text/video and audio embeddings, explicitly modeling cross-modal noise and interaction without reconstructing raw signals—thus balancing efficiency and semantic fidelity. Leveraging feature transfer from pretrained CLAP/CAVP backbones, CBD achieves an average 9.2% improvement in R@1 for audio-video–text retrieval and generation on VGGSound and AudioCaps. Moreover, it operates at 3.8× the inference speed of generative baselines while maintaining competitive alignment quality.
📝 Abstract
Recent works in cross-modal understanding and generation, notably through models like CLAP (Contrastive Language-Audio Pretraining) and CAVP (Contrastive Audio-Visual Pretraining), have significantly enhanced the alignment of text, video, and audio embeddings via a single contrastive loss. However, these methods often overlook the bidirectional interactions and inherent noises present in each modality, which can crucially impact the quality and efficacy of cross-modal integration. To address this limitation, we introduce DiffGAP, a novel approach incorporating a lightweight generative module within the contrastive space. Specifically, our DiffGAP employs a bidirectional diffusion process tailored to bridge the cross-modal gap more effectively. This involves a denoising process on text and video embeddings conditioned on audio embeddings and vice versa, thus facilitating a more nuanced and robust cross-modal interaction. Our experimental results on VGGSound and AudioCaps datasets demonstrate that DiffGAP significantly improves performance in video/text-audio generation and retrieval tasks, confirming its effectiveness in enhancing cross-modal understanding and generation capabilities.