Mitigating Object Hallucination via Robust Local Perception Search

📅 2025-06-07

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

Multimodal large language models (MLLMs) commonly suffer from object hallucination in image understanding—generating text inconsistent with visual content. To address this, we propose Local Perception Search (LPS), a training-free, plug-and-play decoding-time method that, for the first time, formulates attention-based local visual priors as a decoding value function, enabling model-agnostic and training-agnostic hallucination suppression. LPS dynamically corrects generation by performing online search in the decoding space and applying prior-weighted sequence re-ranking. It demonstrates strong robustness and generalization under high-noise image conditions. Experiments on mainstream hallucination benchmarks and noisy datasets show that LPS significantly reduces hallucination rates, outperforming baseline methods by an average margin; under noise, improvement reaches up to 32.7%.

Technology Category

Application Category

📝 Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have enabled them to effectively integrate vision and language, addressing a variety of downstream tasks. However, despite their significant success, these models still exhibit hallucination phenomena, where the outputs appear plausible but do not align with the content of the images. To mitigate this issue, we introduce Local Perception Search (LPS), a decoding method during inference that is both simple and training-free, yet effectively suppresses hallucinations. This method leverages local visual prior information as a value function to correct the decoding process. Additionally, we observe that the impact of the local visual prior on model performance is more pronounced in scenarios with high levels of image noise. Notably, LPS is a plug-and-play approach that is compatible with various models. Extensive experiments on widely used hallucination benchmarks and noisy data demonstrate that LPS significantly reduces the incidence of hallucinations compared to the baseline, showing exceptional performance, particularly in noisy settings.

Problem

Research questions and friction points this paper is trying to address.

Reducing object hallucination in multimodal language models

Improving decoding with local visual prior information

Enhancing performance in noisy image scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Local Perception Search (LPS) for decoding

Leverages local visual prior information

Plug-and-play approach for various models

🔎 Similar Papers

From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models