🤖 AI Summary
Large Vision-Language Models (LVLMs) commonly suffer from hallucinations—textual outputs inconsistent with the input image. Existing contrastive decoding methods, which rely on global visual uncertainty estimation, fail to precisely localize and suppress hallucinated tokens and may even introduce new hallucinations. To address this, we propose Hallucination-Induced Optimization (HIO), a theory-driven framework featuring: (1) a novel fine-grained hallucination token identification mechanism; and (2) Contrary Bradley–Terry preference modeling coupled with multi-stage logits reweighting to enable targeted contrastive reinforcement between hallucinated and grounded tokens. Unlike prior global uncertainty approaches, HIO operates at the token level, enabling precise hallucination mitigation. Extensive experiments demonstrate that HIO significantly reduces hallucination rates across multiple benchmarks while improving both output faithfulness and cross-modal alignment accuracy, consistently outperforming state-of-the-art methods.
📝 Abstract
Although Large Visual Language Models (LVLMs) have demonstrated exceptional abilities in understanding multimodal data, they invariably suffer from hallucinations, leading to a disconnect between the generated text and the corresponding images. Almost all current visual contrastive decoding methods attempt to mitigate these hallucinations by introducing visual uncertainty information that appropriately widens the contrastive logits gap between hallucinatory and targeted ones. However, due to uncontrollable nature of the global visual uncertainty, they struggle to precisely induce the hallucinatory tokens, which severely limits their effectiveness in mitigating hallucinations and may even lead to the generation of undesired hallucinations. To tackle this issue, we conducted the theoretical analysis to promote the effectiveness of contrast decoding. Building on this insight, we introduce a novel optimization strategy named Hallucination-Induced Optimization (HIO). This strategy seeks to amplify the contrast between hallucinatory and targeted tokens relying on a fine-tuned theoretical preference model (i.e., Contrary Bradley-Terry Model), thereby facilitating efficient contrast decoding to alleviate hallucinations in LVLMs. Extensive experimental research demonstrates that our HIO strategy can effectively reduce hallucinations in LVLMs, outperforming state-of-the-art methods across various benchmarks.