Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Large Vision-Language Models (VLMs) suffer from hallucination and inaccurate localization due to insufficient explicit modeling of fine-grained visual content during training, leading to overreliance on linguistic priors. To address this, we propose the Symmetric Visual Contrastive Optimization (S-VCO) objective, which strengthens token-level image-text alignment by contrasting minimally differing image-text pairs. We further introduce MVC, the first counterfactual dataset designed for fine-grained discriminative challenges, supporting automated augmentation and rigorous filtering. Our method integrates contrastive learning, cross-modal fine-grained alignment, and explicit visual-token–text-token supervision. Experiments demonstrate a 22% reduction in hallucination rate on highly vision-dependent tasks, consistent performance gains across multiple benchmarks, and preservation of general-purpose capabilities. The core contributions are the S-VCO mechanism and the MVC dataset—establishing a novel paradigm for mitigating VLM hallucination through structured visual grounding and counterfactual reasoning.

Technology Category

Application Category

📝 Abstract

Recent studies have shown that Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors, resulting in errors in visually grounded tasks and hallucinations. We hypothesize that this issue arises because existing VLMs are not explicitly trained to generate texts that are accurately grounded in fine-grained image details. To enhance visual feedback during VLM training, we propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details and aligning them with corresponding text tokens. To further facilitate this detailed alignment, we introduce MVC, a paired image-text dataset built by automatically filtering and augmenting visual counterfactual data to challenge the model with hard contrastive cases involving Minimal Visual Contrasts. Experiments show that our method consistently improves VLM performance across diverse benchmarks covering various abilities and domains, achieving up to a 22% reduction in hallucinations, and significant gains in vision-centric and general tasks. Notably, these improvements become increasingly pronounced in benchmarks with higher visual dependency. In short, S-VCO offers a significant enhancement of VLM's visually-dependent task performance while retaining or even improving the model's general abilities. We opensource our code at https://s-vco.github.io/

Problem

Research questions and friction points this paper is trying to address.

Addressing VLMs' neglect of image content and over-reliance on language priors

Improving fine-grained visual-text alignment through contrastive optimization

Enhancing vision-dependent performance with minimal visual contrast datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Symmetrical contrastive objective aligns vision-text with fine-grained details

Automated MVC dataset with minimal visual contrastive challenges

Training improves vision-dependent tasks without compromising general abilities

🔎 Similar Papers

Law of Vision Representation in MLLMs