🤖 AI Summary
Large video models (LVMs) suffer from pervasive hallucination—generating outputs inconsistent with the underlying video content. To address this, we propose Dr.V, the first video hallucination diagnostic framework spanning perceptual, temporal, and cognitive levels, enabling interpretable, hierarchical hallucination detection via fine-grained spatiotemporal localization. We introduce Dr.V-Bench, a benchmark comprising 10,000 annotated video samples, and design Dr.V-Agent, an agent-based system integrating spatial grounding, temporal consistency modeling, and high-level semantic reasoning to emulate human video understanding. Extensive experiments demonstrate that Dr.V significantly improves hallucination detection accuracy and model robustness while enhancing decision interpretability. The framework advances principled evaluation of LVM reliability and supports trustworthy deployment. All code and data are publicly released, establishing a new paradigm for evaluating and mitigating hallucinations in video foundation models.
📝 Abstract
Recent advancements in large video models (LVMs) have significantly enhance video understanding. However, these models continue to suffer from hallucinations, producing content that conflicts with input videos. To address this issue, we propose Dr.V, a hierarchical framework covering perceptive, temporal, and cognitive levels to diagnose video hallucination by fine-grained spatial-temporal grounding. Dr.V comprises of two key components: a benchmark dataset Dr.V-Bench and a satellite video agent Dr.V-Agent. Dr.V-Bench includes 10k instances drawn from 4,974 videos spanning diverse tasks, each enriched with detailed spatial-temporal annotation. Dr.V-Agent detects hallucinations in LVMs by systematically applying fine-grained spatial-temporal grounding at the perceptive and temporal levels, followed by cognitive level reasoning. This step-by-step pipeline mirrors human-like video comprehension and effectively identifies hallucinations. Extensive experiments demonstrate that Dr.V-Agent is effective in diagnosing hallucination while enhancing interpretability and reliability, offering a practical blueprint for robust video understanding in real-world scenarios. All our data and code are available at https://github.com/Eurekaleo/Dr.V.