VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding

📅 2025-07-29

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Video anomaly detection (VAD) faces the dual challenge of simultaneously achieving semantic understanding of anomalies and precise temporal localization. To address this, we introduce VAGU—the first benchmark jointly supporting anomaly localization and interpretation—featuring fine-grained temporal boundaries, anomaly categories, natural language explanations, and multiple-choice video question-answering annotations. We propose a training-free two-stage prompting framework, “Glance then Scrutinize” (GtS), which first coarsely localizes anomalous regions and then performs fine-grained semantic parsing. Furthermore, we design JeAUG, a joint evaluation metric that unifies assessment of both semantic interpretability and temporal precision. Experiments demonstrate that VAGU substantially advances the VAD task; GtS achieves state-of-the-art performance without any training; and JeAUG overcomes the limitations of conventional metrics by holistically evaluating both dimensions. Collectively, our work establishes a new paradigm for interpretable, temporally precise VAD.

Technology Category

Application Category

📝 Abstract

Video Anomaly Detection (VAD) aims to identify anomalous events in videos and accurately determine their time intervals. Current VAD methods mainly fall into two categories: traditional DNN-based approaches that focus on temporal localization, and LLM-based approaches that emphasize semantic understanding. Both anomaly understanding and grounding are essential for comprehensive video anomaly detection and can complement each other. However, no existing model or dataset supports both tasks simultaneously. To address this, we introduce VAGU (Video Anomaly Grounding and Understanding), the first benchmark to integrate both tasks. Each VAGU instance includes annotations for anomaly category, semantic explanation, precise temporal grounding and Video QA. We also provide multiple-choice Video QA for objective evaluation. Based on this dataset, we propose Glance then Scrutinize (GtS), a training-free framework guided by textual prompts. The framework first enables coarse localization of high-probability anomalous regions, followed by detailed anomaly interpretation and temporal boundary refinement. Additionally, we propose the JeAUG metric, which jointly evaluates semantic interpretability and temporal precision, overcoming the limitations of traditional metrics. Extensive experiments verify the effectiveness of our benchmark, framework, and evaluation metric.

Problem

Research questions and friction points this paper is trying to address.

Integrate anomaly grounding and understanding in videos

Develop first benchmark for joint anomaly tasks

Propose framework for localization and interpretation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based benchmark integrates grounding and understanding

Training-free framework with textual prompt guidance

Joint metric evaluates interpretability and temporal precision

🔎 Similar Papers

VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs