Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation

📅 2026-04-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

226K/year
🤖 AI Summary
This work addresses the prevalent issue of object hallucination in vision-language models, which often stems from their overreliance on linguistic priors. To enhance visual faithfulness without requiring additional training, the authors propose a Positive-Negative Decoding (PND) framework that dynamically intervenes in the generation process during inference. The positive path amplifies salient visual evidence, while the negative path constructs counterfactual representations to suppress misleading object features. This approach uniquely identifies and mitigates attention deficiencies within the model through multi-layer attention mechanisms, counterfactual feature degradation, and cross-model compatibility. Evaluated on standard benchmarks—including POPE, MME, and CHAIR—the method achieves state-of-the-art performance, improving accuracy by up to 6.5%, substantially reducing hallucinations, and enhancing fine-grained visual descriptions across diverse architectures such as LLaVA, InstructBLIP, InternVL, and Qwen-VL.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) are frequently undermined by object hallucination--generating content that contradicts visual reality--due to an over-reliance on linguistic priors. We introduce Positive-and-Negative Decoding (PND), a training-free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our key finding of a critical attention deficit in VLMs, where visual features are empirically under-weighted. Our framework corrects this via a dual-path contrast: The positive path amplifies salient visual evidence using multi-layer attention to encourage faithful descriptions, directly counteracting the attention deficit. Simultaneously, the negative path identifies and degrades the core object's features to create a strong counterfactual, which penalizes ungrounded, prior-dominant generation. By contrasting the model's outputs from these two perspectives at each step, PND steers generation towards text that is not just linguistically probable, but visually factual. Extensive experiments on benchmarks like POPE, MME, and CHAIR show that PND achieves state-of-the-art performance with up to 6.5% accuracy improvement, substantially reducing object hallucination while also enhancing descriptive detail--all without requiring any model retraining. The method generalizes effectively across diverse VLM architectures including LLaVA, InstructBLIP, InternVL, and Qwen-VL.
Problem

Research questions and friction points this paper is trying to address.

object hallucination
vision-language models
visual grounding
linguistic priors
visual fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Positive-and-Negative Decoding
visual grounding
object hallucination
attention deficit
training-free inference
🔎 Similar Papers
No similar papers found.
Y
Yubo Jiang
School of Astronautics, Beihang University, Beijing 102206, China
X
Xin Yang
Longcat Interaction Team, Meituan, Beijing 100102, China
A
Abudukelimu Wuerkaixi
Longcat Interaction Team, Meituan, Beijing 100102, China
Z
Zheming Yuan
School of Astronautics, Beihang University, Beijing 102206, China
Xuxin Cheng
Xuxin Cheng
University of California, San Diego
Fengying Xie
Fengying Xie
Beihang University
Z
Zhiguo Jiang
Tianmushan Laboratory, Beihang University, Hangzhou 311115, China
C
Cao Liu
Longcat Interaction Team, Meituan, Beijing 100102, China
K
Ke Zeng
Longcat Interaction Team, Meituan, Beijing 100102, China
Haopeng Zhang
Haopeng Zhang
Image Processing Center, School of Astronautics, Beihang University
Computer VisionPattern RecognitionImage Processing