Hallucination Localization in Video Captioning

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses fine-grained hallucinations in video captioning by introducing, for the first time, the **segment-level hallucination localization task**, which identifies erroneous words or phrases within captions—moving beyond coarse sentence-level hallucination detection. To support this task, we construct HLVC-Dataset, the first finely annotated benchmark comprising 1,167 video–caption pairs; captions are initially generated by VideoLLMs and rigorously verified and hallucination-annotated by human experts at the token/phrase level. We further design a tailored VideoLLM-based baseline model and conduct comprehensive quantitative and qualitative evaluations. Experiments demonstrate that our approach effectively localizes hallucinated segments, significantly improving error traceability and diagnostic precision. This work establishes a novel task paradigm, provides the first dedicated benchmark dataset, and lays methodological foundations for fine-grained analysis and evaluation of multimodal hallucinations in video understanding.

Technology Category

Application Category

📝 Abstract
We propose a novel task, hallucination localization in video captioning, which aims to identify hallucinations in video captions at the span level (i.e. individual words or phrases). This allows for a more detailed analysis of hallucinations compared to existing sentence-level hallucination detection task. To establish a benchmark for hallucination localization, we construct HLVC-Dataset, a carefully curated dataset created by manually annotating 1,167 video-caption pairs from VideoLLM-generated captions. We further implement a VideoLLM-based baseline method and conduct quantitative and qualitative evaluations to benchmark current performance on hallucination localization.
Problem

Research questions and friction points this paper is trying to address.

Identifying hallucinations in video captions at span level
Creating a dataset to benchmark hallucination localization
Evaluating baseline methods for video caption hallucination detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces span-level hallucination localization in video captioning
Constructs HLVC-Dataset with annotated video-caption pairs
Implements VideoLLM-based baseline method for evaluation
🔎 Similar Papers
No similar papers found.
S
Shota Nakada
LY Corporation
K
Kazuhiro Saito
LY Corporation
Y
Yuchi Ishikawa
LY Corporation
Hokuto Munakata
Hokuto Munakata
LY Corporation
audio signal processingsound source separationmultimodal AI
Tatsuya Komatsu
Tatsuya Komatsu
LINE Corporation
Signal ProcessingSound Event DetectionSource Separation
M
Masayoshi Kondo
LY Corporation