🤖 AI Summary
To address the challenge of identifying massive anomalous events in surveillance videos—where manual inspection is infeasible—this paper proposes a language-mediated zero-shot anomaly detection framework. Methodologically, it converts video frame sequences into structured textual descriptions, integrating scene graph priors, 3D pose modeling, and cross-clip spatial memory to achieve spatial grounding without task-specific annotations. To preserve privacy, it incorporates a GAN-based privacy filter coupled with action-geometry preservation, ensuring identity anonymization while retaining semantic action integrity. Experiments on UCF-Crime and RWF-2000 demonstrate effectiveness for salient anomalies but reveal sensitivity to noise and occlusion; privacy transformations primarily degrade action discrimination consistency. This work advances embodied video understanding and zero-shot vision-language reasoning for real-world security applications.
📝 Abstract
The widespread use of cameras in our society has created an overwhelming amount of video data, far exceeding the capacity for human monitoring. This presents a critical challenge for public safety and security, as the timely detection of anomalous or criminal events is crucial for effective response and prevention. The ability for an embodied agent to recognize unexpected events is fundamentally tied to its capacity for spatial reasoning. This paper investigates the spatial reasoning of vision-language models (VLMs) by framing anomalous action recognition as a zero-shot, language-grounded task, addressing the embodied perception challenge of interpreting dynamic 3D scenes from sparse 2D video. Specifically, we investigate whether small, pre-trained vision--LLMs can act as spatially-grounded, zero-shot anomaly detectors by converting video into text descriptions and scoring labels via textual entailment. We evaluate four open models on UCF-Crime and RWF-2000 under prompting and privacy-preserving conditions. Few-shot exemplars can improve accuracy for some models, but may increase false positives, and privacy filters -- especially full-body GAN transforms -- introduce inconsistencies that degrade accuracy. These results chart where current vision--LLMs succeed (simple, spatially salient events) and where they falter (noisy spatial cues, identity obfuscation). Looking forward, we outline concrete paths to strengthen spatial grounding without task-specific training: structure-aware prompts, lightweight spatial memory across clips, scene-graph or 3D-pose priors during description, and privacy methods that preserve action-relevant geometry. This positions zero-shot, language-grounded pipelines as adaptable building blocks for embodied, real-world video understanding. Our implementation for evaluating VLMs is publicly available at: https://github.com/pascalbenschopTU/VLLM_AnomalyRecognition