🤖 AI Summary
Large Vision-Language Models (VLMs) suffer from hallucination and inaccurate localization due to insufficient explicit modeling of fine-grained visual content during training, leading to overreliance on linguistic priors. To address this, we propose the Symmetric Visual Contrastive Optimization (S-VCO) objective, which strengthens token-level image-text alignment by contrasting minimally differing image-text pairs. We further introduce MVC, the first counterfactual dataset designed for fine-grained discriminative challenges, supporting automated augmentation and rigorous filtering. Our method integrates contrastive learning, cross-modal fine-grained alignment, and explicit visual-token–text-token supervision. Experiments demonstrate a 22% reduction in hallucination rate on highly vision-dependent tasks, consistent performance gains across multiple benchmarks, and preservation of general-purpose capabilities. The core contributions are the S-VCO mechanism and the MVC dataset—establishing a novel paradigm for mitigating VLM hallucination through structured visual grounding and counterfactual reasoning.
📝 Abstract
Recent studies have shown that Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors, resulting in errors in visually grounded tasks and hallucinations. We hypothesize that this issue arises because existing VLMs are not explicitly trained to generate texts that are accurately grounded in fine-grained image details. To enhance visual feedback during VLM training, we propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details and aligning them with corresponding text tokens. To further facilitate this detailed alignment, we introduce MVC, a paired image-text dataset built by automatically filtering and augmenting visual counterfactual data to challenge the model with hard contrastive cases involving Minimal Visual Contrasts. Experiments show that our method consistently improves VLM performance across diverse benchmarks covering various abilities and domains, achieving up to a 22% reduction in hallucinations, and significant gains in vision-centric and general tasks. Notably, these improvements become increasingly pronounced in benchmarks with higher visual dependency. In short, S-VCO offers a significant enhancement of VLM's visually-dependent task performance while retaining or even improving the model's general abilities. We opensource our code at https://s-vco.github.io/