🤖 AI Summary
Existing static benchmarks for hallucination evaluation in multimodal large language models (MLLMs) suffer from data contamination and poor generalization. Method: We propose the first open-set, dynamic, object-level hallucination evaluation paradigm. It models real-world concepts as a graph structure with distributional relationships and employs graph neural representation learning for distribution-driven compositional concept sampling. We further design a dual-path evaluation framework—comprising generative and discriminative components—to enable fine-grained detection of both existence and attribute-level hallucinations. Contribution/Results: Our approach uncovers significantly higher hallucination rates in mainstream MLLMs, empirically validating the data contamination hypothesis. Generated test samples are highly interpretable, revealing systematic hallucination patterns and enabling targeted model fine-tuning. This enhances model reliability across both general-purpose and low-resource settings, while maintaining rigorous evaluation validity.
📝 Abstract
Hallucination poses a persistent challenge for multimodal large language models (MLLMs). However, existing benchmarks for evaluating hallucinations are generally static, which may overlook the potential risk of data contamination. To address this issue, we propose ODE, an open-set, dynamic protocol designed to evaluate object hallucinations in MLLMs at both the existence and attribute levels. ODE employs a graph-based structure to represent real-world object concepts, their attributes, and the distributional associations between them. This structure facilitates the extraction of concept combinations based on diverse distributional criteria, generating varied samples for structured queries that evaluate hallucinations in both generative and discriminative tasks. Through the generation of new samples, dynamic concept combinations, and varied distribution frequencies, ODE mitigates the risk of data contamination and broadens the scope of evaluation. This protocol is applicable to both general and specialized scenarios, including those with limited data. Experimental results demonstrate the effectiveness of our protocol, revealing that MLLMs exhibit higher hallucination rates when evaluated with ODE-generated samples, which indicates potential data contamination. Furthermore, these generated samples aid in analyzing hallucination patterns and fine-tuning models, offering an effective approach to mitigating hallucinations in MLLMs.