Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation

📅 2023-05-08
🏛️ Annual Meeting of the Association for Computational Linguistics
📈 Citations: 16
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language pretraining (VLP) models suffer from low computational efficiency in modeling long visual sequences and suboptimal semantic alignment due to the conventional InfoNCE loss in cross-modal contrastive learning, which erroneously treats semantically similar samples as negatives. To address these issues, we propose Semantic-Aware Contrastive Learning (SACL), a mutual information maximization–inspired framework. SACL introduces, for the first time, a cross-modal similarity modulation mechanism that dynamically adjusts negative-pair contrastive strength based on semantic proximity. It further formulates a theoretically grounded, mutual information–based weighted contrastive loss and integrates it into a lightweight VLP architecture. Extensive experiments on downstream tasks—including VQA, NLVR2, and image/text retrieval—demonstrate consistent and significant performance gains. Results validate that preserving semantically informative “false negatives” enhances both model generalization and cross-modal alignment fidelity.
📝 Abstract
In this paper, we reconsider the problem of (partial) false negative samples from the Mutual Information (MI) Maximization perspective, the traditional contrastive loss (like InfoNCE loss) will equally push away the anchor of all positive samples and negative samples regardless of their possible semantic similarities. We theoretically show that InfoNCE loss will not only maximize the MI between the anchor and positive samples but minimize the MI between the anchor and false negative samples even though they share similar semantic which could provide a possible theoretical explanation for the observation of the existence of false negative samples in the cross-modal contrastive learning will decrease the downstream task performance of VLP models. Above analysis motivate us to propose the VLP model with a novel Semantic Awared Contrastive Learning framework named SACL where different negative samples are assigned with different contrastive weights according to the semantic similarity between them and the anchor.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost in vision-language models
Selects text-relevant image patches dynamically
Maintains performance while accelerating training speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-guided patch selection reduces visual sequence
Dynamic text-dependent attention identifies relevant tokens
No extra parameters added to Vision Transformers
🔎 Similar Papers
No similar papers found.
Chaoya Jiang
Chaoya Jiang
Shandong University
Multimodal Large Language Model
W
Wei Ye
National Engineering Research Center for Software Engineering, Peking University
H
Haiyang Xu
DAMO Academy, Alibaba Group
Shikun Zhang
Shikun Zhang
北京大学
J
Jie Zhang
F
Fei Huang
DAMO Academy, Alibaba Group