Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding

📅 2025-09-15

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Large video models (LVMs) suffer from pervasive hallucination—generating outputs inconsistent with the underlying video content. To address this, we propose Dr.V, the first video hallucination diagnostic framework spanning perceptual, temporal, and cognitive levels, enabling interpretable, hierarchical hallucination detection via fine-grained spatiotemporal localization. We introduce Dr.V-Bench, a benchmark comprising 10,000 annotated video samples, and design Dr.V-Agent, an agent-based system integrating spatial grounding, temporal consistency modeling, and high-level semantic reasoning to emulate human video understanding. Extensive experiments demonstrate that Dr.V significantly improves hallucination detection accuracy and model robustness while enhancing decision interpretability. The framework advances principled evaluation of LVM reliability and supports trustworthy deployment. All code and data are publicly released, establishing a new paradigm for evaluating and mitigating hallucinations in video foundation models.

Technology Category

Application Category

📝 Abstract

Recent advancements in large video models (LVMs) have significantly enhance video understanding. However, these models continue to suffer from hallucinations, producing content that conflicts with input videos. To address this issue, we propose Dr.V, a hierarchical framework covering perceptive, temporal, and cognitive levels to diagnose video hallucination by fine-grained spatial-temporal grounding. Dr.V comprises of two key components: a benchmark dataset Dr.V-Bench and a satellite video agent Dr.V-Agent. Dr.V-Bench includes 10k instances drawn from 4,974 videos spanning diverse tasks, each enriched with detailed spatial-temporal annotation. Dr.V-Agent detects hallucinations in LVMs by systematically applying fine-grained spatial-temporal grounding at the perceptive and temporal levels, followed by cognitive level reasoning. This step-by-step pipeline mirrors human-like video comprehension and effectively identifies hallucinations. Extensive experiments demonstrate that Dr.V-Agent is effective in diagnosing hallucination while enhancing interpretability and reliability, offering a practical blueprint for robust video understanding in real-world scenarios. All our data and code are available at https://github.com/Eurekaleo/Dr.V.

Problem

Research questions and friction points this paper is trying to address.

Diagnose video hallucinations in large video models

Fine-grained spatial-temporal grounding for accurate detection

Enhance interpretability and reliability of video understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical framework for video hallucination diagnosis

Fine-grained spatial-temporal grounding technique

Multi-level perception-temporal-cognition reasoning pipeline

🔎 Similar Papers

No similar papers found.