A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis

📅 2025-11-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video anomaly analysis methods are largely limited to frame-level scoring, lacking spatiotemporal localization and semantic interpretability, while heavily relying on labeled data and exhibiting poor generalization. This paper introduces the first end-to-end zero-shot video anomaly analysis framework that jointly achieves temporal detection, spatial localization, and natural language explanation. Our approach leverages vision-language foundation models and employs test-time chained inference—integrating intra-task optimization and inter-task cascading—without any training or fine-tuning. The core innovation is a prompt-driven, multi-granularity collaborative reasoning mechanism that eliminates task-specific design and data dependency. Evaluated across multiple benchmarks for anomaly detection, localization, and explanation, our method achieves state-of-the-art zero-shot performance, demonstrating superior generalization and intrinsic interpretability.

Technology Category

Application Category

📝 Abstract
Most video-anomaly research stops at frame-wise detection, offering little insight into why an event is abnormal, typically outputting only frame-wise anomaly scores without spatial or semantic context. Recent video anomaly localization and video anomaly understanding methods improve explainability but remain data-dependent and task-specific. We propose a unified reasoning framework that bridges the gap between temporal detection, spatial localization, and textual explanation. Our approach is built upon a chained test-time reasoning process that sequentially connects these tasks, enabling holistic zero-shot anomaly analysis without any additional training. Specifically, our approach leverages intra-task reasoning to refine temporal detections and inter-task chaining for spatial and semantic understanding, yielding improved interpretability and generalization in a fully zero-shot manner. Without any additional data or gradients, our method achieves state-of-the-art zero-shot performance across multiple video anomaly detection, localization, and explanation benchmarks. The results demonstrate that careful prompt design with task-wise chaining can unlock the reasoning power of foundation models, enabling practical, interpretable video anomaly analysis in a fully zero-shot manner. Project Page: https://rathgrith.github.io/Unified_Frame_VAA/.
Problem

Research questions and friction points this paper is trying to address.

Unifies temporal detection, spatial localization, and textual explanation for anomalies
Enables holistic zero-shot video anomaly analysis without training
Improves interpretability and generalization in fully zero-shot manner
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chained test-time reasoning connects detection and explanation
Intra-task reasoning refines temporal detections without training
Inter-task chaining enables spatial and semantic understanding
🔎 Similar Papers
No similar papers found.
D
Dongheng Lin
Institute of Information Science, Beijing Jiaotong University
Mengxue Qu
Mengxue Qu
Beijing Jiaotong University
Vision&LanguageDetectionSegmentation
K
Kunyang Han
Institute of Information Science, Beijing Jiaotong University
Jianbo Jiao
Jianbo Jiao
University of Birmingham | University of Oxford
Computer VisionMachine Learning
X
Xiaojie Jin
Institute of Information Science, Beijing Jiaotong University
Yunchao Wei
Yunchao Wei
Professor, Beijing Jiaotong University, UTS, UIUC, NUS
Computer VisionMachine Learning