🤖 AI Summary
Existing benchmarks for evaluating hallucinations in multimodal large language models lack fine-grained annotations and domain diversity, making it difficult to accurately assess the models’ ability to localize hallucinations in long image descriptions. To address this limitation, this work introduces DetailVerifyBench—a challenging, cross-domain benchmark specifically designed for hallucination evaluation in extended textual descriptions. It comprises over one thousand images spanning five distinct domains, with each description averaging more than 200 words. Notably, DetailVerifyBench provides the first human-annotated, token-level labels covering multiple types of hallucinations. This benchmark substantially enhances both the precision and breadth of hallucination detection evaluation, establishing itself as the most demanding dataset currently available for localized hallucination assessment in long-form multimodal generation.
📝 Abstract
Accurately detecting and localizing hallucinations is a critical task for ensuring high reliability of image captions. In the era of Multimodal Large Language Models (MLLMs), captions have evolved from brief sentences into comprehensive narratives, often spanning hundreds of words. This shift exponentially increases the challenge: models must now pinpoint specific erroneous spans or words within extensive contexts, rather than merely flag response-level inconsistencies. However, existing benchmarks lack the fine granularity and domain diversity required to evaluate this capability. To bridge this gap, we introduce DetailVerifyBench, a rigorous benchmark comprising 1,000 high-quality images across five distinct domains. With an average caption length of over 200 words and dense, token-level annotations of multiple hallucination types, it stands as the most challenging benchmark for precise hallucination localization in the field of long image captioning to date. Our benchmark is available at https://zyx-hhnkh.github.io/DetailVerifyBench/.