EventVAD: Training-Free Event-Aware Video Anomaly Detection

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video anomaly detection (VAD) faces two key challenges: supervised methods rely heavily on large-scale annotated data and suffer from poor generalization, while training-free approaches struggle to precisely localize fine-grained visual events. This paper proposes EventVAD—the first event-aware, training-free VAD framework. It models dynamic spatiotemporal relationships via a temporal decay-constrained graph and achieves fine-grained anomaly localization through signal-ratio-threshold-driven unsupervised event boundary detection. Furthermore, it introduces a hierarchical prompting strategy enhanced by event consistency to synergistically leverage multimodal large language models (7B) for semantic reasoning. Critically, EventVAD requires no training, significantly improving generalization to unseen anomalies. On UCF-Crime and XD-Violence, it achieves state-of-the-art performance under the training-free setting, outperforming not only same-scale but also larger-parameter baselines across all metrics.

Technology Category

Application Category

📝 Abstract
Video Anomaly Detection~(VAD) focuses on identifying anomalies within videos. Supervised methods require an amount of in-domain training data and often struggle to generalize to unseen anomalies. In contrast, training-free methods leverage the intrinsic world knowledge of large language models (LLMs) to detect anomalies but face challenges in localizing fine-grained visual transitions and diverse events. Therefore, we propose EventVAD, an event-aware video anomaly detection framework that combines tailored dynamic graph architectures and multimodal LLMs through temporal-event reasoning. Specifically, EventVAD first employs dynamic spatiotemporal graph modeling with time-decay constraints to capture event-aware video features. Then, it performs adaptive noise filtering and uses signal ratio thresholding to detect event boundaries via unsupervised statistical features. The statistical boundary detection module reduces the complexity of processing long videos for MLLMs and improves their temporal reasoning through event consistency. Finally, it utilizes a hierarchical prompting strategy to guide MLLMs in performing reasoning before determining final decisions. We conducted extensive experiments on the UCF-Crime and XD-Violence datasets. The results demonstrate that EventVAD with a 7B MLLM achieves state-of-the-art (SOTA) in training-free settings, outperforming strong baselines that use 7B or larger MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Detecting anomalies in videos without training data
Localizing fine-grained visual transitions and diverse events
Improving temporal reasoning with event-aware features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic graph modeling with time-decay constraints
Unsupervised statistical event boundary detection
Hierarchical prompting strategy for MLLM reasoning
🔎 Similar Papers
No similar papers found.