Hallucination of Multimodal Large Language Models: A Survey

📅 2024-04-29
🏛️ arXiv.org
📈 Citations: 113
Influential: 3
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) suffer from pervasive hallucinations stemming from visual–linguistic semantic misalignment, severely undermining their reliability and real-world applicability. This work systematically investigates the root causes of such hallucinations and introduces, for the first time, a fine-grained taxonomy. We integrate mainstream benchmarks—including POPE and MME—with quantitative metrics to conduct cross-modal alignment diagnostics and empirical evaluation. Further, we synthesize and categorize mitigation strategies across three dimensions: prompt engineering, parameter-efficient fine-tuning, and decoding control, constructing a comprehensive methodological map. Our key contributions include: (1) a unified analytical framework for MLLM hallucination; (2) an open-source resource repository, Awesome-MLLM-Hallucination; and (3) a clear articulation of open challenges and future research directions—thereby providing both theoretical foundations and practical guidelines for enhancing MLLM robustness.

Technology Category

Application Category

📝 Abstract
This survey presents a comprehensive analysis of the phenomenon of hallucination in multimodal large language models (MLLMs), also known as Large Vision-Language Models (LVLMs), which have demonstrated significant advancements and remarkable abilities in multimodal tasks. Despite these promising developments, MLLMs often generate outputs that are inconsistent with the visual content, a challenge known as hallucination, which poses substantial obstacles to their practical deployment and raises concerns regarding their reliability in real-world applications. This problem has attracted increasing attention, prompting efforts to detect and mitigate such inaccuracies. We review recent advances in identifying, evaluating, and mitigating these hallucinations, offering a detailed overview of the underlying causes, evaluation benchmarks, metrics, and strategies developed to address this issue. Additionally, we analyze the current challenges and limitations, formulating open questions that delineate potential pathways for future research. By drawing the granular classification and landscapes of hallucination causes, evaluation benchmarks, and mitigation methods, this survey aims to deepen the understanding of hallucinations in MLLMs and inspire further advancements in the field. Through our thorough and in-depth review, we contribute to the ongoing dialogue on enhancing the robustness and reliability of MLLMs, providing valuable insights and resources for researchers and practitioners alike. Resources are available at: https://github.com/showlab/Awesome-MLLM-Hallucination.
Problem

Research questions and friction points this paper is trying to address.

Analyzing hallucination in multimodal large language models (MLLMs).
Detecting and mitigating inconsistent outputs with visual content.
Reviewing causes, benchmarks, and solutions for MLLM hallucinations.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Survey analyzes MLLM hallucination causes
Reviews benchmarks for hallucination evaluation
Proposes strategies to mitigate MLLM hallucinations
🔎 Similar Papers
No similar papers found.
Zechen Bai
Zechen Bai
National University of Singapore
MultimodalComputer VisionVirtual Reality
P
Pichao Wang
Amazon Prime Video, USA
Tianjun Xiao
Tianjun Xiao
Tesla Autopilot
Computer VisionMultimediaMachine Learning
T
Tong He
AWS Shanghai AI Lab, China
Zongbo Han
Zongbo Han
Assistant Professor, BUPT; TJU
Machine Learning
Z
Zheng Zhang
AWS Shanghai AI Lab, China
M
Mike Zheng Shou
Show Lab, National University of Singapore, Singapore