Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the challenge of visual hallucinations in vision-language models (VLMs) during self-improvement, which hinder their reasoning capabilities. To mitigate this issue, the authors propose VC-STaR, a novel framework that introduces a visual contrast mechanism into the VLM self-refinement pipeline. By constructing contrastive visual question answering (VQA) pairs—where questions are semantically similar but grounded in images with subtle visual differences—the model is guided to generate more accurate reasoning chains. Leveraging this approach, the authors automatically curate VisCoR-55K, a high-quality supervised fine-tuning dataset derived entirely from self-generated data without external annotations. Experiments demonstrate that VC-STaR significantly outperforms existing self-improvement methods across multiple VLMs and even surpasses models fine-tuned on state-of-the-art visual reasoning benchmarks.

Technology Category

Application Category

📝 Abstract

Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge:~visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR-55K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. Project at: https://github.com/zhiyupan42/VC-STaR.

Problem

Research questions and friction points this paper is trying to address.

visual hallucinations

vision language models

self-improving

visual reasoning

contrastive VQA

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual contrast

self-improving

visual reasoning