An Analysis Focused on Womens Safety: Can VAD Models Be Enhanced by a Multi-modal Dataset?

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses a critical gap in video anomaly detection (VAD) research: the lack of datasets and methods tailored to female-centric safety scenarios, particularly under challenging conditions such as low illumination, low resolution, and long-distance surveillance, where behaviors like stalking, necklace snatching, and harassment are difficult to detect. To bridge this gap, the authors introduce ExtrAnom—the first multimodal benchmark dataset specifically designed for women’s safety—comprising 1,001 real-world, low-quality surveillance videos accompanied by multi-source textual descriptions covering five categories of female-related anomalous events. The dataset is annotated through a hybrid approach combining human expertise and large language models, enabling training and cross-modal evaluation of vision-language models. Experiments demonstrate that existing VAD benchmarks fail to effectively detect these anomalies, whereas models trained on ExtrAnom achieve significantly higher accuracy and generalization, thereby filling a crucial void in the field.
📝 Abstract
Women's safety and security are paramount for a modern society. Crimes against women occur in daylight as well as in low-light conditions. Often, such events are captured through real-world surveillance cameras that operate at lower resolutions. Despite substantial progress in CV-related research, video anomaly detection (VAD) focused on women's safety has not yet been adequately addressed. Existing video anomaly datasets contain well-lit, high-resolution, close-shot videos, and fail to represent women-centric anomalies such as chain snatching, stalking, inappropriate touch, and other subtle forms of crime against women. To address these problems, we propose the ExtrAnom dataset, a new multi-modal benchmark containing 1001 videos with textual descriptions, 500 normal and 501 anomalous, classified into 5 different types of women-centric crimes. The dataset comprises low-light (8%), low-resolution videos (13%), long-shot (15%), along with daylight (64%) anomalous videos. And it covers anomalous events like stalking (3.9%), chain snatching (17.6%), kidnapping (7.3%), assassinations (2.3%), harassment (18.9%), and normal (50%). Each video is supplemented with 4 textual annotations, including one human-generated and three LLM-generated descriptions, enabling cross-modal and VLM-based validations. The aim of creating a women-centric dataset is to accurately detect the women-centric anomaly patterns, which are possible to observe visually. The dataset supplements the VLMs to accurately generate video-level descriptions. ExtrAnom has been benchmarked against popular unimodal and multi-modal VAD datasets (e.g., XD-Violence, UCF-Crime, and UCA) and SOTA methods. Experiments reveal that the existing datasets are insufficient to train models for detecting women-centric anomalies.
Problem

Research questions and friction points this paper is trying to address.

women's safety
video anomaly detection
multi-modal dataset
surveillance video
women-centric crimes
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-modal dataset
women-centric anomaly detection
video anomaly detection (VAD)
visual-language models (VLMs)
low-light surveillance