Stop learning it all to mitigate visual hallucination, Focus on the hallucination target

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Multimodal large language models (MLLMs) frequently exhibit visual hallucinations—generating objects not present in the input image—thereby severely compromising factual consistency and reliability in vision-language tasks. To address this, we propose a hallucination-targeted fine-grained preference learning framework. Our method is the first to localize preference optimization to specific hallucinated response segments and their corresponding image regions, enabling pixel-level supervision. We construct a novel dataset containing paired hallucinated/correct responses with precise pixel-level grounding annotations. By integrating multimodal alignment modeling with customized response chunking and region-aware labeling, our approach achieves interpretable and spatially grounded hallucination suppression. Extensive experiments demonstrate that our method significantly reduces hallucination rates across multiple visual hallucination benchmarks, substantially improving model factuality and reliability without degrading overall task performance.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) frequently suffer from hallucination issues, generating information about objects that are not present in input images during vision-language tasks. These hallucinations particularly undermine model reliability in practical applications requiring accurate object identification. To address this challenge, we propose mymethod, a preference learning approach that mitigates hallucinations by focusing on targeted areas where they occur. To implement this, we build a dataset containing hallucinated responses, correct responses, and target information (i.e., objects present in the images and the corresponding chunk positions in responses affected by hallucinations). By applying a preference learning method restricted to these specific targets, the model can filter out irrelevant signals and focus on correcting hallucinations. This allows the model to produce more factual responses by concentrating solely on relevant information. Experimental results demonstrate that mymethod effectively reduces hallucinations across multiple vision hallucination tasks, improving the reliability and performance of MLLMs without diminishing overall performance.

Problem

Research questions and friction points this paper is trying to address.

MLLMs generate false object info in vision-language tasks

Hallucinations reduce reliability in accurate object identification

Propose targeted preference learning to correct hallucinations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Preference learning targets hallucination areas

Dataset includes hallucinated and correct responses

Focuses on relevant information to reduce errors

🔎 Similar Papers

No similar papers found.