From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models

📅 2024-10-09
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from object hallucination—generating objects absent in the input image—thereby undermining factual consistency and reliability. This work identifies insufficient visual feature disentanglement during modality alignment—not deficient visual encoder representation—as the primary cause. To address this, we propose PATCH tuning: a plug-and-play, architecture-agnostic fine-tuning strategy that introduces bounding-box-guided adaptive virtual tokens for fine-grained, spatially localizable visual feature extraction; integrates modular feature disentanglement alignment; and performs end-to-end multimodal fine-tuning. Crucially, PATCH requires no modification to the backbone architecture. Evaluated across multiple multimodal hallucination benchmarks, PATCH achieves state-of-the-art performance, significantly reducing out-of-image object generation while improving factual consistency and model reliability.

Technology Category

Application Category

📝 Abstract
Hallucinations in large vision-language models (LVLMs) are a significant challenge, i.e., generating objects that are not presented in the visual input, which impairs their reliability. Recent studies often attribute hallucinations to a lack of understanding of visual input, yet ignore a more fundamental issue: the model's inability to effectively extract or decouple visual features. In this paper, we revisit the hallucinations in LVLMs from an architectural perspective, investigating whether the primary cause lies in the visual encoder (feature extraction) or the modal alignment module (feature decoupling). Motivated by our findings on the preliminary investigation, we propose a novel tuning strategy, PATCH, to mitigate hallucinations in LVLMs. This plug-and-play method can be integrated into various LVLMs, utilizing adaptive virtual tokens to extract object features from bounding boxes, thereby addressing hallucinations caused by insufficient decoupling of visual features. PATCH achieves state-of-the-art performance on multiple multi-modal hallucination datasets. We hope this approach provides researchers with deeper insights into the underlying causes of hallucinations in LVLMs, fostering further advancements and innovation in this field.
Problem

Research questions and friction points this paper is trying to address.

Addresses object hallucinations in vision-language models
Investigates visual feature extraction and decoupling issues
Proposes plug-and-play method to mitigate false object generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play tuning with adaptive virtual tokens
Extracts object features from bounding boxes
Addresses insufficient decoupling of visual features
🔎 Similar Papers
No similar papers found.
Y
Yuying Shang
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China
Xinyi Zeng
Xinyi Zeng
Sichuan University
Medical Image SegmentationMedical Image ReconstructionMulti-modal Learning
Y
Yutao Zhu
Gaoling School of Artificial Intelligence, Renmin University of China
Zhengwei Fang
Zhengwei Fang
Beijing Jiaotong University
Adversarial RobustnessVision Language ModelComputer VisionUncertainty
J
Jingyuan Zhang
Kuaishou Technology Inc., Beijing, China
J
Jiawei Chen
Shanghai Key Laboratory of Multi. Info. Processing, East China Normal University
X
Xiao Yang
Dept. of Comp. Sci. and Tech., Institute for AI, Tsinghua University, Beijing, China
Z
Zinan Liu
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China
Y
Yu Tian
Dept. of Comp. Sci. and Tech., Institute for AI, Tsinghua University, Beijing, China