🤖 AI Summary
Vision-language models (VLMs) often suffer from factual inaccuracies due to hallucinated details in generated outputs. To address this, we propose a self-reflective training paradigm that requires no human annotation or external supervision. Our method automatically constructs high-quality preference data by detecting internal inconsistencies between short and long responses—both generated by the same VLM for the same input—and leverages contrastive learning between binary correctness judgments and detailed explanations to convert these inconsistencies into unsupervised training signals. This is the first work to systematically exploit intra-model response consistency (i.e., short vs. long output alignment) as a source of self-supervised signals. Evaluated on multiple hallucination benchmarks, our approach significantly improves factual accuracy while preserving strong instruction-following capability on LLaVA-Bench and MMBench. The method is computationally efficient, scalable, and practically deployable.
📝 Abstract
Vision-language models often hallucinate details, generating non-existent objects or inaccurate attributes that compromise output reliability. Existing methods typically address these issues via extensive human annotations or external supervision from more powerful models. In this work, we present a novel framework that leverages the model's self-consistency between long responses and short answers to generate preference pairs for training. We observe that short binary questions tend to yield highly reliable responses, which can be used to query the target model to evaluate and rank its generated responses. Specifically, we design a self-reflection pipeline where detailed model responses are compared against concise binary answers, and inconsistency signals are utilized to automatically curate high-quality training data without human annotations or external model-based supervision. By relying solely on self-consistency rather than external supervision, our method offers a scalable and efficient solution that effectively reduces hallucinations using unlabeled data. Extensive experiments on multiple benchmarks, i.e., AMBER, MultiObject-Hal (ROPE), Object HalBench, and MMHal-Bench, demonstrate significant improvements in factual grounding and reliability. Moreover, our approach maintains robust instruction-following ability, as evidenced by enhanced performance on LLaVA-Bench and MMBench.