Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models

πŸ“… 2025-05-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current vision-language models (VLMs) suffer from weak visual encoder discriminability and low cross-modal alignment accuracy in compositional reasoning tasks, primarily due to neglecting image-domain negative samples, failing to distinguish negative sample difficulty, and insufficient positive sample alignment. To address these issues, we propose a text-to-vision hard negative cross-modal transfer mechanism that maps text-based hard negatives into semantically perturbed image negatives. We further design a multimodal hard negative contrastive loss coupled with a dynamic margin loss, enabling joint optimization where contrastive margins adaptively adjust according to sample difficulty. Our approach encompasses visual perturbation generation, hard negative mining, and alignment enhancement. Evaluated on three compositional reasoning benchmarks, our method achieves significant performance gains, effectively strengthening the visual encoder’s fine-grained discriminative capability and improving cross-modal semantic alignment precision.

Technology Category

Application Category

πŸ“ Abstract
Vision-Language Models (VLMs) are essential for multimodal tasks, especially compositional reasoning (CR) tasks, which require distinguishing fine-grained semantic differences between visual and textual embeddings. However, existing methods primarily fine-tune the model by generating text-based hard negative samples, neglecting the importance of image-based negative samples, which results in insufficient training of the visual encoder and ultimately impacts the overall performance of the model. Moreover, negative samples are typically treated uniformly, without considering their difficulty levels, and the alignment of positive samples is insufficient, which leads to challenges in aligning difficult sample pairs. To address these issues, we propose Adaptive Hard Negative Perturbation Learning (AHNPL). AHNPL translates text-based hard negatives into the visual domain to generate semantically disturbed image-based negatives for training the model, thereby enhancing its overall performance. AHNPL also introduces a contrastive learning approach using a multimodal hard negative loss to improve the model's discrimination of hard negatives within each modality and a dynamic margin loss that adjusts the contrastive margin according to sample difficulty to enhance the distinction of challenging sample pairs. Experiments on three public datasets demonstrate that our method effectively boosts VLMs' performance on complex CR tasks. The source code is available at https://github.com/nynu-BDAI/AHNPL.
Problem

Research questions and friction points this paper is trying to address.

Enhancing VLMs' fine-grained semantic distinction in CR tasks
Addressing neglect of image-based negatives in VLM training
Improving alignment and discrimination of hard sample pairs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates image-based negatives from text hard negatives
Uses multimodal hard negative loss for discrimination
Adjusts contrastive margin dynamically by sample difficulty
πŸ”Ž Similar Papers
No similar papers found.
X
Xin Huang
School of Artificial Intelligence and Software Engineering, Nanyang Normal University, Henan, China; Collaborative Innovation Center of Intelligent Explosion-proof Equipment, Henan, China
Ruibin Li
Ruibin Li
University of Toronto
Persistent MemoryFile System
Tong Jia
Tong Jia
Peking University
AIOpsAnomaly DetectionLog AnalysisAI for Medical Research
W
Wei Zheng
School of Artificial Intelligence and Software Engineering, Nanyang Normal University, Henan, China; Collaborative Innovation Center of Intelligent Explosion-proof Equipment, Henan, China
Y
Ya Wang
School of Artificial Intelligence and Software Engineering, Nanyang Normal University, Henan, China; Collaborative Innovation Center of Intelligent Explosion-proof Equipment, Henan, China