🤖 AI Summary
To address the limitations in fine-grained motion modeling and multi-scale contextual understanding in weakly supervised anomaly detection for surveillance videos, this paper proposes a collaborative learning framework that jointly models short-, medium-, and long-term temporal features. We innovatively design a multi-timescale tubelet sampling mechanism, integrated with the Video Swin Transformer to capture spatiotemporal dynamics. Additionally, we introduce weakly supervised contrastive learning and cross-dataset transfer adaptation to enhance generalization. We construct VADD—the first large-scale, real-world anomaly video dataset—comprising 18 anomaly categories and 2,591 video clips. Extensive experiments demonstrate state-of-the-art performance: 89.78% AUC on UCF-Crime, 95.32% AUC on ShanghaiTech, and 84.57% AP on XD-Violence—surpassing all existing methods.
📝 Abstract
Detection of anomaly events is relevant for public safety and requires a combination of fine-grained motion information and contextual events at variable time-scales. To this end, we propose a Multi-Timescale Feature Learning (MTFL) method to enhance the representation of anomaly features. Short, medium, and long temporal tubelets are employed to extract spatio-temporal video features using a Video Swin Transformer. Experimental results demonstrate that MTFL outperforms state-of-the-art methods on the UCF-Crime dataset, achieving an anomaly detection performance 89.78% AUC. Moreover, it performs complementary to SotA with 95.32% AUC on the ShanghaiTech and 84.57% AP on the XD-Violence dataset. Furthermore, we generate an extended dataset of the UCF-Crime for development and evaluation on a wider range of anomalies, namely Video Anomaly Detection Dataset (VADD), involving 2,591 videos in 18 classes with extensive coverage of realistic anomalies.