🤖 AI Summary
This work addresses the susceptibility of current video large language models (Video-LLMs) to hallucinations concerning faithfulness and factuality, exacerbated by the absence of systematic benchmarks evaluating robustness under perturbations. The authors introduce a diagnostic benchmark comprising 9,800 question-answer pairs derived from both real and synthetic videos, along with four evaluation protocols: base, visual degradation, evidence tampering, and temporal intervention. They propose a fine-grained hallucination taxonomy and novel induction mechanisms to expose model fragility under non-ideal conditions, revealing pronounced inertness on temporally sensitive queries. Using a multidimensional framework incorporating Resist Rate (RR) and Temporal Sensitivity Score (TSS), evaluations across 14 prominent Video-LLMs demonstrate that high base accuracy does not ensure perturbation robustness; evidence tampering severely undermines stability, temporal intervention drastically degrades performance, and most open-source models exhibit near-zero TSS on factual temporal tasks.
📝 Abstract
Despite rapid progress, Video Large Language Models (Video-LLMs) remain unreliable due to hallucinations, which are outputs that contradict either video evidence (faithfulness) or verifiable world knowledge (factuality). Existing benchmarks provide limited coverage of factuality hallucinations and predominantly evaluate models only in clean settings. We introduce \textsc{INFACT}, a diagnostic benchmark comprising 9{,}800 QA instances with fine-grained taxonomies for faithfulness and factuality, spanning real and synthetic videos. \textsc{INFACT} evaluates models in four modes: Base (clean), Visual Degradation, Evidence Corruption, and Temporal Intervention for order-sensitive items. Reliability under induced modes is quantified using Resist Rate (RR) and Temporal Sensitivity Score (TSS). Experiments on 14 representative Video-LLMs reveal that higher Base-mode accuracy does not reliably translate to higher reliability in the induced modes, with evidence corruption reducing stability and temporal intervention yielding the largest degradation. Notably, many open-source baselines exhibit near-zero TSS on factuality, indicating pronounced temporal inertia on order-sensitive questions.