Energy-Guided Decoding for Object Hallucination Mitigation

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Large vision-language models (LVLMs) frequently exhibit object hallucination in visual question answering (VQA), particularly manifesting as an unwarranted bias toward affirmative (“yes”) answers. Method: This paper proposes a training-free, external-model-free energy-guided decoding method. It first identifies and quantifies the pervasive “yes” answer bias in VQA datasets; then, during inference, it dynamically computes energy scores over hidden states across layers and selects the lowest-energy state for decoding—thereby mitigating the model’s over-reliance on affirmative responses. The method is architecture-agnostic, requires no parameter updates or input perturbations, and integrates seamlessly into standard LVLM inference pipelines. Contribution/Results: Evaluated on POPE, MME, and MMVP benchmarks, our approach achieves an average accuracy gain of 4.82%, reduces the “yes”-answer rate discrepancy by 8.81%, and consistently outperforms multiple baselines in F1 score. It significantly enhances both fairness and reliability of VQA inference without increasing computational overhead or compromising generality.

Technology Category

Application Category

📝 Abstract

Mitigating object hallucination in large vision-language models (LVLMs) is critical to their safe deployment. Existing methods either are restricted to specific decoding methods, or demand sophisticated modifications to visual inputs, or rely on knowledge from external models. In this work, we first reveal the phenomenon that VLMs exhibit significant imbalance in the ``Yes'' ratio ( ie, the fraction of ``Yes'' answers among the total number of questions) across three different visual question answering (VQA) datasets. Furthermore, we propose an energy-based decoding method, which dynamically selects the hidden states from the layer with minimal energy score. It is simple yet effective in reducing the bias for the yes ratio while boosting performance across three benchmarks (POPE, MME, and MMVP). Our method consistently improves accuracy and F1 score on three VQA datasets across three commonly used VLMs over several baseline methods. The average accuracy improvement is 4.82% compared to greedy decoding. Moreover, the average yes-ratio gap reduction is 8.81%, meaning the proposed method is less biased as shown in Figure 1.

Problem

Research questions and friction points this paper is trying to address.

Mitigating object hallucination in large vision-language models

Reducing bias in yes-ratio across VQA datasets

Improving accuracy and F1 scores in VLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Energy-based decoding for bias reduction

Dynamic hidden state selection method

Improves accuracy and F1 scores

🔎 Similar Papers

From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models