🤖 AI Summary
Multimodal large language models (MLLMs) suffer from object hallucination, comprising two distinct types: omission hallucination (failing to describe present objects) and fabrication hallucination (describing non-existent objects). Prior methods erroneously assume a shared origin, leading to trade-offs between the two. This work is the first to identify their heterogeneous causes: omission stems from insufficient confidence in visual-to-linguistic mapping, whereas fabrication arises from spurious cross-modal associations in the joint representation space. To address this, we propose the Visual-Semantic Attention Potential Field—a theoretical framework—and design VPFC, a plug-and-play, fine-tuning-free calibration method. VPFC achieves decoupled hallucination control via visual attention intervention, statistical bias analysis, and cross-modal association disentanglement. Experiments demonstrate that VPFC simultaneously reduces omission hallucination and suppresses fabrication hallucination, establishing a novel paradigm for MLLM hallucination mitigation—interpretable, balanced, and robust.
📝 Abstract
Multimodal Large Language Models (MLLMs) have achieved impressive advances, yet object hallucination remains a persistent challenge. Existing methods, based on the flawed assumption that omission and fabrication hallucinations share a common cause, often reduce omissions only to trigger more fabrications. In this work, we overturn this view by demonstrating that omission hallucinations arise from insufficient confidence when mapping perceived visual features to linguistic expressions, whereas fabrication hallucinations result from spurious associations within the cross-modal representation space due to statistical biases in the training corpus. Building on findings from visual attention intervention experiments, we propose the Visual-Semantic Attention Potential Field, a conceptual framework that reveals how the model constructs visual evidence to infer the presence or absence of objects. Leveraging this insight, we introduce Visual Potential Field Calibration (VPFC), a plug-and-play hallucination mitigation method that effectively reduces omission hallucinations without introducing additional fabrication hallucinations. Our findings reveal a critical oversight in current object hallucination research and chart new directions for developing more robust and balanced hallucination mitigation strategies.