🤖 AI Summary
Multimodal large language models (MLLMs) commonly suffer from object hallucination in image understanding—generating text inconsistent with visual content. To address this, we propose Local Perception Search (LPS), a training-free, plug-and-play decoding-time method that, for the first time, formulates attention-based local visual priors as a decoding value function, enabling model-agnostic and training-agnostic hallucination suppression. LPS dynamically corrects generation by performing online search in the decoding space and applying prior-weighted sequence re-ranking. It demonstrates strong robustness and generalization under high-noise image conditions. Experiments on mainstream hallucination benchmarks and noisy datasets show that LPS significantly reduces hallucination rates, outperforming baseline methods by an average margin; under noise, improvement reaches up to 32.7%.
📝 Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have enabled them to effectively integrate vision and language, addressing a variety of downstream tasks. However, despite their significant success, these models still exhibit hallucination phenomena, where the outputs appear plausible but do not align with the content of the images. To mitigate this issue, we introduce Local Perception Search (LPS), a decoding method during inference that is both simple and training-free, yet effectively suppresses hallucinations. This method leverages local visual prior information as a value function to correct the decoding process. Additionally, we observe that the impact of the local visual prior on model performance is more pronounced in scenarios with high levels of image noise. Notably, LPS is a plug-and-play approach that is compatible with various models. Extensive experiments on widely used hallucination benchmarks and noisy data demonstrate that LPS significantly reduces the incidence of hallucinations compared to the baseline, showing exceptional performance, particularly in noisy settings.