A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis

📅 2025-11-02

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing video anomaly analysis methods are largely limited to frame-level scoring, lacking spatiotemporal localization and semantic interpretability, while heavily relying on labeled data and exhibiting poor generalization. This paper introduces the first end-to-end zero-shot video anomaly analysis framework that jointly achieves temporal detection, spatial localization, and natural language explanation. Our approach leverages vision-language foundation models and employs test-time chained inference—integrating intra-task optimization and inter-task cascading—without any training or fine-tuning. The core innovation is a prompt-driven, multi-granularity collaborative reasoning mechanism that eliminates task-specific design and data dependency. Evaluated across multiple benchmarks for anomaly detection, localization, and explanation, our method achieves state-of-the-art zero-shot performance, demonstrating superior generalization and intrinsic interpretability.

Technology Category

Application Category

📝 Abstract

Most video-anomaly research stops at frame-wise detection, offering little insight into why an event is abnormal, typically outputting only frame-wise anomaly scores without spatial or semantic context. Recent video anomaly localization and video anomaly understanding methods improve explainability but remain data-dependent and task-specific. We propose a unified reasoning framework that bridges the gap between temporal detection, spatial localization, and textual explanation. Our approach is built upon a chained test-time reasoning process that sequentially connects these tasks, enabling holistic zero-shot anomaly analysis without any additional training. Specifically, our approach leverages intra-task reasoning to refine temporal detections and inter-task chaining for spatial and semantic understanding, yielding improved interpretability and generalization in a fully zero-shot manner. Without any additional data or gradients, our method achieves state-of-the-art zero-shot performance across multiple video anomaly detection, localization, and explanation benchmarks. The results demonstrate that careful prompt design with task-wise chaining can unlock the reasoning power of foundation models, enabling practical, interpretable video anomaly analysis in a fully zero-shot manner. Project Page: https://rathgrith.github.io/Unified_Frame_VAA/.

Problem

Research questions and friction points this paper is trying to address.

Unifies temporal detection, spatial localization, and textual explanation for anomalies

Enables holistic zero-shot video anomaly analysis without training

Improves interpretability and generalization in fully zero-shot manner

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chained test-time reasoning connects detection and explanation

Intra-task reasoning refines temporal detections without training

Inter-task chaining enables spatial and semantic understanding

🔎 Similar Papers

Hybrid Video Anomaly Detection for Anomalous Scenarios in Autonomous Driving