🤖 AI Summary
This work addresses fine-grained hallucinations in video captioning by introducing, for the first time, the **segment-level hallucination localization task**, which identifies erroneous words or phrases within captions—moving beyond coarse sentence-level hallucination detection. To support this task, we construct HLVC-Dataset, the first finely annotated benchmark comprising 1,167 video–caption pairs; captions are initially generated by VideoLLMs and rigorously verified and hallucination-annotated by human experts at the token/phrase level. We further design a tailored VideoLLM-based baseline model and conduct comprehensive quantitative and qualitative evaluations. Experiments demonstrate that our approach effectively localizes hallucinated segments, significantly improving error traceability and diagnostic precision. This work establishes a novel task paradigm, provides the first dedicated benchmark dataset, and lays methodological foundations for fine-grained analysis and evaluation of multimodal hallucinations in video understanding.
📝 Abstract
We propose a novel task, hallucination localization in video captioning, which aims to identify hallucinations in video captions at the span level (i.e. individual words or phrases). This allows for a more detailed analysis of hallucinations compared to existing sentence-level hallucination detection task. To establish a benchmark for hallucination localization, we construct HLVC-Dataset, a carefully curated dataset created by manually annotating 1,167 video-caption pairs from VideoLLM-generated captions. We further implement a VideoLLM-based baseline method and conduct quantitative and qualitative evaluations to benchmark current performance on hallucination localization.