Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) often suffer from factual inaccuracies due to hallucinated details in generated outputs. To address this, we propose a self-reflective training paradigm that requires no human annotation or external supervision. Our method automatically constructs high-quality preference data by detecting internal inconsistencies between short and long responses—both generated by the same VLM for the same input—and leverages contrastive learning between binary correctness judgments and detailed explanations to convert these inconsistencies into unsupervised training signals. This is the first work to systematically exploit intra-model response consistency (i.e., short vs. long output alignment) as a source of self-supervised signals. Evaluated on multiple hallucination benchmarks, our approach significantly improves factual accuracy while preserving strong instruction-following capability on LLaVA-Bench and MMBench. The method is computationally efficient, scalable, and practically deployable.

Technology Category

Application Category

📝 Abstract
Vision-language models often hallucinate details, generating non-existent objects or inaccurate attributes that compromise output reliability. Existing methods typically address these issues via extensive human annotations or external supervision from more powerful models. In this work, we present a novel framework that leverages the model's self-consistency between long responses and short answers to generate preference pairs for training. We observe that short binary questions tend to yield highly reliable responses, which can be used to query the target model to evaluate and rank its generated responses. Specifically, we design a self-reflection pipeline where detailed model responses are compared against concise binary answers, and inconsistency signals are utilized to automatically curate high-quality training data without human annotations or external model-based supervision. By relying solely on self-consistency rather than external supervision, our method offers a scalable and efficient solution that effectively reduces hallucinations using unlabeled data. Extensive experiments on multiple benchmarks, i.e., AMBER, MultiObject-Hal (ROPE), Object HalBench, and MMHal-Bench, demonstrate significant improvements in factual grounding and reliability. Moreover, our approach maintains robust instruction-following ability, as evidenced by enhanced performance on LLaVA-Bench and MMBench.
Problem

Research questions and friction points this paper is trying to address.

Reduces hallucinations in vision-language models via self-reflection
Leverages self-consistency between long and short responses automatically
Eliminates need for human annotations or external model supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-consistency between responses generates training pairs
Self-reflection pipeline compares detailed and concise answers
Automated curation without human annotations or external supervision
🔎 Similar Papers
No similar papers found.