Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Current vision-language models (VLMs) suffer from weak visual encoder discriminability and low cross-modal alignment accuracy in compositional reasoning tasks, primarily due to neglecting image-domain negative samples, failing to distinguish negative sample difficulty, and insufficient positive sample alignment. To address these issues, we propose a text-to-vision hard negative cross-modal transfer mechanism that maps text-based hard negatives into semantically perturbed image negatives. We further design a multimodal hard negative contrastive loss coupled with a dynamic margin loss, enabling joint optimization where contrastive margins adaptively adjust according to sample difficulty. Our approach encompasses visual perturbation generation, hard negative mining, and alignment enhancement. Evaluated on three compositional reasoning benchmarks, our method achieves significant performance gains, effectively strengthening the visual encoder’s fine-grained discriminative capability and improving cross-modal semantic alignment precision.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) are essential for multimodal tasks, especially compositional reasoning (CR) tasks, which require distinguishing fine-grained semantic differences between visual and textual embeddings. However, existing methods primarily fine-tune the model by generating text-based hard negative samples, neglecting the importance of image-based negative samples, which results in insufficient training of the visual encoder and ultimately impacts the overall performance of the model. Moreover, negative samples are typically treated uniformly, without considering their difficulty levels, and the alignment of positive samples is insufficient, which leads to challenges in aligning difficult sample pairs. To address these issues, we propose Adaptive Hard Negative Perturbation Learning (AHNPL). AHNPL translates text-based hard negatives into the visual domain to generate semantically disturbed image-based negatives for training the model, thereby enhancing its overall performance. AHNPL also introduces a contrastive learning approach using a multimodal hard negative loss to improve the model's discrimination of hard negatives within each modality and a dynamic margin loss that adjusts the contrastive margin according to sample difficulty to enhance the distinction of challenging sample pairs. Experiments on three public datasets demonstrate that our method effectively boosts VLMs' performance on complex CR tasks. The source code is available at https://github.com/nynu-BDAI/AHNPL.

Problem

Research questions and friction points this paper is trying to address.

Enhancing VLMs' fine-grained semantic distinction in CR tasks

Addressing neglect of image-based negatives in VLM training

Improving alignment and discrimination of hard sample pairs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates image-based negatives from text hard negatives

Uses multimodal hard negative loss for discrimination

Adjusts contrastive margin dynamically by sample difficulty

🔎 Similar Papers

No similar papers found.