ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models

📅 2024-09-14

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing static benchmarks for hallucination evaluation in multimodal large language models (MLLMs) suffer from data contamination and poor generalization. Method: We propose the first open-set, dynamic, object-level hallucination evaluation paradigm. It models real-world concepts as a graph structure with distributional relationships and employs graph neural representation learning for distribution-driven compositional concept sampling. We further design a dual-path evaluation framework—comprising generative and discriminative components—to enable fine-grained detection of both existence and attribute-level hallucinations. Contribution/Results: Our approach uncovers significantly higher hallucination rates in mainstream MLLMs, empirically validating the data contamination hypothesis. Generated test samples are highly interpretable, revealing systematic hallucination patterns and enabling targeted model fine-tuning. This enhances model reliability across both general-purpose and low-resource settings, while maintaining rigorous evaluation validity.

Technology Category

Application Category

📝 Abstract

Hallucination poses a persistent challenge for multimodal large language models (MLLMs). However, existing benchmarks for evaluating hallucinations are generally static, which may overlook the potential risk of data contamination. To address this issue, we propose ODE, an open-set, dynamic protocol designed to evaluate object hallucinations in MLLMs at both the existence and attribute levels. ODE employs a graph-based structure to represent real-world object concepts, their attributes, and the distributional associations between them. This structure facilitates the extraction of concept combinations based on diverse distributional criteria, generating varied samples for structured queries that evaluate hallucinations in both generative and discriminative tasks. Through the generation of new samples, dynamic concept combinations, and varied distribution frequencies, ODE mitigates the risk of data contamination and broadens the scope of evaluation. This protocol is applicable to both general and specialized scenarios, including those with limited data. Experimental results demonstrate the effectiveness of our protocol, revealing that MLLMs exhibit higher hallucination rates when evaluated with ODE-generated samples, which indicates potential data contamination. Furthermore, these generated samples aid in analyzing hallucination patterns and fine-tuning models, offering an effective approach to mitigating hallucinations in MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating object hallucinations in multimodal large language models dynamically

Mitigating data contamination risks in hallucination benchmarks for MLLMs

Analyzing and reducing hallucination patterns in generative and discriminative tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-set dynamic protocol for MLLM hallucination evaluation

Graph-based structure for object concept representation

Dynamic concept combinations mitigate data contamination

🔎 Similar Papers

Hallucination of Multimodal Large Language Models: A Survey

2024-04-29arXiv.orgCitations: 113