🤖 AI Summary
Multimodal alignment is often hindered by distributional discrepancies between vision and language modalities, causing methods like CLIP—which rely solely on pairwise mutual information maximization—to neglect global distributional consistency. To address this, we propose CS-Aligner, the first framework to incorporate the Cauchy–Schwarz divergence into multimodal alignment. CS-Aligner jointly optimizes this divergence—capturing holistic inter-modal distributional disparity—and mutual information—modeling fine-grained semantic correspondence—thereby enabling synergistic distribution-level and sample-level alignment. The method supports both unpaired data and token-level alignment, offering enhanced flexibility and precision. Extensive experiments on text-to-image generation and cross-modal retrieval demonstrate that CS-Aligner significantly outperforms CLIP and other baselines, effectively bridging the modality gap and improving alignment compactness and downstream task performance.
📝 Abstract
Multimodal alignment is crucial for various downstream tasks such as cross-modal generation and retrieval. Previous multimodal approaches like CLIP maximize the mutual information mainly by aligning pairwise samples across modalities while overlooking the distributional differences, leading to suboptimal alignment with modality gaps. In this paper, to overcome the limitation, we propose CS-Aligner, a novel and straightforward framework that performs distributional vision-language alignment by integrating Cauchy-Schwarz (CS) divergence with mutual information. In the proposed framework, we find that the CS divergence and mutual information serve complementary roles in multimodal alignment, capturing both the global distribution information of each modality and the pairwise semantic relationships, yielding tighter and more precise alignment. Moreover, CS-Aligher enables incorporating additional information from unpaired data and token-level representations, enhancing flexible and fine-grained alignment in practice. Experiments on text-to-image generation and cross-modality retrieval tasks demonstrate the effectiveness of our method on vision-language alignment.