V-ITI: Mitigating Hallucinations in Multimodal Large Language Models via Visual Inference-Time Intervention

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) suffer from vision-language hallucinations—generating text inconsistent with image content—primarily due to visual neglect, limiting their deployment in high-precision applications. Existing intervention methods lack principled criteria for *when* to intervene, often causing over-intervention and introducing new hallucinations. This paper proposes a lightweight, inference-time dynamic intervention framework: (1) it introduces a discriminative probe built on head-level activation patterns to precisely detect visual neglect; (2) it selectively retrieves pre-stored visual activation states *only when necessary*, applying targeted modulation via internal attention states. Crucially, the method requires no fine-tuning. Evaluated across eight benchmarks and multiple MLLMs, it significantly reduces vision-related hallucinations while preserving performance on general language tasks. The approach achieves strong effectiveness, cross-model generalizability, and computational efficiency.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) excel in numerous vision-language tasks yet suffer from hallucinations, producing content inconsistent with input visuals, that undermine reliability in precision-sensitive domains. This issue stems from a fundamental problem of visual neglect, where models fail to adequately prioritize input images. Existing methods typically alleviate hallucinations by intervening in the attention score or output logits, focusing on"how to intervene"but overlooking the prerequisite"when to intervene", which leads to the"over-intervention"problem and subsequently introduces new hallucinations and unnecessary computational overhead. To address this gap, we first investigate the mechanism of visual neglect and reveal it can be accurately detected via head-level activation patterns in MLLMs. We thus propose V-ITI, a lightweight visual inference-time intervention framework integrating a Visual Neglect Detector that identifies visual neglect via head-level discriminative probes and a Visual Recall Intervenor that modulates activations with prestored visual activation information only when the visual neglect is detected. Extensive experiments across eight benchmarks and different MLLM families demonstrate that V-ITI consistently mitigates vision-related hallucinations while preserving general task performance.
Problem

Research questions and friction points this paper is trying to address.

Mitigates hallucinations in multimodal language models
Addresses visual neglect by detecting when to intervene
Reduces over-intervention and computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Detects visual neglect via head-level activation patterns
Intervenes only when visual neglect is detected
Modulates activations using prestored visual information
Nan Sun
Nan Sun
University of New South Wales
CybersecurityArtificial Intelligence Applications
Z
Zhenyu Zhang
Baidu Inc., Beijing, China
Xixun Lin
Xixun Lin
Institute of Information Engineering, Chinese Academy of Sciences
Data miningGraph representation learningLarge language model
K
Kun Wang
Nanyang Technological University
Y
Yanmin Shang
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
N
Naibin Gu
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Shuohuan Wang
Shuohuan Wang
Baidu
Natural Language ProcessingDeep Learning
Y
Yu Sun
Baidu Inc., Beijing, China
H
Hua Wu
Baidu Inc., Beijing, China
H
Haifeng Wang
Baidu Inc., Beijing, China
Yanan Cao
Yanan Cao
Institute of Information Engineering, Chinese Academy of Sciences