Anomize: Better Open Vocabulary Video Anomaly Detection

📅 2025-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-vocabulary video anomaly detection (OVVAD) methods suffer from two key limitations: ambiguous novel anomaly detection (inaccurate anomaly scoring) and category confusion (misclassifying anomalies as semantically similar base classes). To address these, we propose a multi-source vision–text collaborative modeling framework. Our approach introduces a novel multi-level vision–text joint calibration mechanism to mitigate detection ambiguity; designs a label-relation-graph-guided text encoder that explicitly models hierarchical semantic structures among categories to enhance cross-class alignment and discriminability; and integrates multi-scale video features, CLIP-style vision–language alignment, graph neural network–driven label relation modeling, and contrastive learning–enhanced anomaly scoring. Evaluated on UCF-Crime and XD-Violence, our method achieves state-of-the-art performance: +12.7% mAP for novel anomalies and a 34% reduction in misclassification rate.

Technology Category

Application Category

📝 Abstract
Open Vocabulary Video Anomaly Detection (OVVAD) seeks to detect and classify both base and novel anomalies. However, existing methods face two specific challenges related to novel anomalies. The first challenge is detection ambiguity, where the model struggles to assign accurate anomaly scores to unfamiliar anomalies. The second challenge is categorization confusion, where novel anomalies are often misclassified as visually similar base instances. To address these challenges, we explore supplementary information from multiple sources to mitigate detection ambiguity by leveraging multiple levels of visual data alongside matching textual information. Furthermore, we propose incorporating label relations to guide the encoding of new labels, thereby improving alignment between novel videos and their corresponding labels, which helps reduce categorization confusion. The resulting Anomize framework effectively tackles these issues, achieving superior performance on UCF-Crime and XD-Violence datasets, demonstrating its effectiveness in OVVAD.
Problem

Research questions and friction points this paper is trying to address.

Detect and classify base and novel anomalies in videos
Reduce detection ambiguity for unfamiliar anomalies
Mitigate categorization confusion between novel and base anomalies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages multi-level visual and textual data
Incorporates label relations for encoding
Improves alignment between videos and labels
🔎 Similar Papers
No similar papers found.
F
Fei Li
National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University
W
Wenxuan Liu
State Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
Jingjing Chen
Jingjing Chen
Fudan University
MultimediaComputer VisionMachine LearningPattern recognition
Ruixu Zhang
Ruixu Zhang
Tsinghua University
Y
Yuran Wang
National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University
X
Xian Zhong
Hubei Key Laboratory of Transportation Internet of Things, Wuhan University of Technology
Z
Zheng Wang
National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University