Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models (LVLMs) often hallucinate semantically plausible yet image-irrelevant content due to overreliance on linguistic priors, undermining visual faithfulness. To address this, we propose Conditional Pointwise Mutual Information (C-PMI)–guided decoding: a novel framework that jointly models visual and textual tokens within a C-PMI objective, formulated as a bilevel optimization problem; and introduces a dynamic token purification mechanism that collaboratively refines multimodal representations to strengthen cross-modal alignment. Crucially, our method requires no model fine-tuning or additional parameters, preserving the original decoding efficiency. Extensive experiments across multiple benchmarks demonstrate significant reductions in hallucination rates, validating the effectiveness and generalizability of mutual-information–driven adaptive decoding for enhancing visual fidelity.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
Problem

Research questions and friction points this paper is trying to address.

Reducing hallucinations in LVLMs by enhancing text-image dependency
Mitigating over-reliance on language priors in LVLM decoding
Improving visual-textual token alignment to minimize irrelevant responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

C-PMI calibrated decoding reduces LVLM hallucinations
Bi-level optimization maximizes visual-textual mutual information
Token purification dynamically regulates relevant token sampling
🔎 Similar Papers
No similar papers found.
H
Hao Fang
Tsinghua Shenzhen Internation Graduate School, Tsinghua University
C
Changle Zhou
Harbin Institute of Technology, Shenzhen; Pengcheng Labortary
Jiawei Kong
Jiawei Kong
Tsinghua University
Trustworthy AI
Kuofeng Gao
Kuofeng Gao
Tsinghua University
Large Language ModelTrustworthy AIBackdoor Learning
B
Bin Chen
Harbin Institute of Technology, Shenzhen; Pengcheng Labortary
T
Tao Liang
ByteDance
G
Guojun Ma
ByteDance
S
Shutao Xia
Tsinghua Shenzhen Internation Graduate School, Tsinghua University; Pengcheng Labortary