Human-Centric Anomaly Detection in Surveillance Videos Using YOLO-World and Spatio-Temporal Deep Learning

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Anomaly detection in surveillance videos faces significant challenges due to high inter-class diversity, severe class imbalance, and susceptibility to scene-level interference—particularly in human-centric settings. To address these issues, we propose a multi-class anomaly detection framework integrating human-centric preprocessing with spatiotemporal modeling. First, open-vocabulary human detection and identity-consistent tracking are achieved using YOLO-World and ByteTrack. Second, foreground enhancement and Gaussian blurring suppress background clutter. Third, spatial features are extracted via InceptionV3, temporal dynamics are modeled using BiLSTM, and—novelty introduced herein—a vision-language model is incorporated into the spatiotemporal deep network to enhance semantic understanding. Evaluated on the five-class UCF-Crime subset, our method achieves a mean accuracy of 92.41% and F1-scores exceeding 0.85 for all classes, demonstrating strong generalization and robustness in real-world scenarios.

Technology Category

Application Category

📝 Abstract

Anomaly detection in surveillance videos remains a challenging task due to the diversity of abnormal events, class imbalance, and scene-dependent visual clutter. To address these issues, we propose a robust deep learning framework that integrates human-centric preprocessing with spatio-temporal modeling for multi-class anomaly classification. Our pipeline begins by applying YOLO-World - an open-vocabulary vision-language detector - to identify human instances in raw video clips, followed by ByteTrack for consistent identity-aware tracking. Background regions outside detected bounding boxes are suppressed via Gaussian blurring, effectively reducing scene-specific distractions and focusing the model on behaviorally relevant foreground content. The refined frames are then processed by an ImageNet-pretrained InceptionV3 network for spatial feature extraction, and temporal dynamics are captured using a bidirectional LSTM (BiLSTM) for sequence-level classification. Evaluated on a five-class subset of the UCF-Crime dataset (Normal, Burglary, Fighting, Arson, Explosion), our method achieves a mean test accuracy of 92.41% across three independent trials, with per-class F1-scores consistently exceeding 0.85. Comprehensive evaluation metrics - including confusion matrices, ROC curves, and macro/weighted averages - demonstrate strong generalization and resilience to class imbalance. The results confirm that foreground-focused preprocessing significantly enhances anomaly discrimination in real-world surveillance scenarios.

Problem

Research questions and friction points this paper is trying to address.

Detecting diverse abnormal human activities in surveillance videos

Addressing class imbalance and scene-dependent visual clutter issues

Improving multi-class anomaly classification through human-centric preprocessing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses YOLO-World for human instance detection

Applies Gaussian blurring to suppress background distractions

Integrates InceptionV3 and BiLSTM for spatio-temporal modeling

🔎 Similar Papers

Hybrid Video Anomaly Detection for Anomalous Scenarios in Autonomous Driving