Mitigating Object Hallucination via Robust Local Perception Search

📅 2025-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) commonly suffer from object hallucination in image understanding—generating text inconsistent with visual content. To address this, we propose Local Perception Search (LPS), a training-free, plug-and-play decoding-time method that, for the first time, formulates attention-based local visual priors as a decoding value function, enabling model-agnostic and training-agnostic hallucination suppression. LPS dynamically corrects generation by performing online search in the decoding space and applying prior-weighted sequence re-ranking. It demonstrates strong robustness and generalization under high-noise image conditions. Experiments on mainstream hallucination benchmarks and noisy datasets show that LPS significantly reduces hallucination rates, outperforming baseline methods by an average margin; under noise, improvement reaches up to 32.7%.

Technology Category

Application Category

📝 Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have enabled them to effectively integrate vision and language, addressing a variety of downstream tasks. However, despite their significant success, these models still exhibit hallucination phenomena, where the outputs appear plausible but do not align with the content of the images. To mitigate this issue, we introduce Local Perception Search (LPS), a decoding method during inference that is both simple and training-free, yet effectively suppresses hallucinations. This method leverages local visual prior information as a value function to correct the decoding process. Additionally, we observe that the impact of the local visual prior on model performance is more pronounced in scenarios with high levels of image noise. Notably, LPS is a plug-and-play approach that is compatible with various models. Extensive experiments on widely used hallucination benchmarks and noisy data demonstrate that LPS significantly reduces the incidence of hallucinations compared to the baseline, showing exceptional performance, particularly in noisy settings.
Problem

Research questions and friction points this paper is trying to address.

Reducing object hallucination in multimodal language models
Improving decoding with local visual prior information
Enhancing performance in noisy image scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Local Perception Search (LPS) for decoding
Leverages local visual prior information
Plug-and-play approach for various models
🔎 Similar Papers
No similar papers found.
Z
Zixian Gao
Shanghai Artificial Intelligence Laboratory, Center for Future Media & School of Computer Science and Engineering, University of Electronic Science and Technology of China
C
Chao Yang
Shanghai Artificial Intelligence Laboratory
Zhanhui Zhou
Zhanhui Zhou
UC Berkeley
X
Xing Xu
Center for Future Media & School of Computer Science and Engineering, University of Electronic Science and Technology of China
Chaochao Lu
Chaochao Lu
Shanghai AI Laboratory
Causal AI