Evaluation of Vision-LLMs in Surveillance Video

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of identifying massive anomalous events in surveillance videos—where manual inspection is infeasible—this paper proposes a language-mediated zero-shot anomaly detection framework. Methodologically, it converts video frame sequences into structured textual descriptions, integrating scene graph priors, 3D pose modeling, and cross-clip spatial memory to achieve spatial grounding without task-specific annotations. To preserve privacy, it incorporates a GAN-based privacy filter coupled with action-geometry preservation, ensuring identity anonymization while retaining semantic action integrity. Experiments on UCF-Crime and RWF-2000 demonstrate effectiveness for salient anomalies but reveal sensitivity to noise and occlusion; privacy transformations primarily degrade action discrimination consistency. This work advances embodied video understanding and zero-shot vision-language reasoning for real-world security applications.

Technology Category

Application Category

📝 Abstract
The widespread use of cameras in our society has created an overwhelming amount of video data, far exceeding the capacity for human monitoring. This presents a critical challenge for public safety and security, as the timely detection of anomalous or criminal events is crucial for effective response and prevention. The ability for an embodied agent to recognize unexpected events is fundamentally tied to its capacity for spatial reasoning. This paper investigates the spatial reasoning of vision-language models (VLMs) by framing anomalous action recognition as a zero-shot, language-grounded task, addressing the embodied perception challenge of interpreting dynamic 3D scenes from sparse 2D video. Specifically, we investigate whether small, pre-trained vision--LLMs can act as spatially-grounded, zero-shot anomaly detectors by converting video into text descriptions and scoring labels via textual entailment. We evaluate four open models on UCF-Crime and RWF-2000 under prompting and privacy-preserving conditions. Few-shot exemplars can improve accuracy for some models, but may increase false positives, and privacy filters -- especially full-body GAN transforms -- introduce inconsistencies that degrade accuracy. These results chart where current vision--LLMs succeed (simple, spatially salient events) and where they falter (noisy spatial cues, identity obfuscation). Looking forward, we outline concrete paths to strengthen spatial grounding without task-specific training: structure-aware prompts, lightweight spatial memory across clips, scene-graph or 3D-pose priors during description, and privacy methods that preserve action-relevant geometry. This positions zero-shot, language-grounded pipelines as adaptable building blocks for embodied, real-world video understanding. Our implementation for evaluating VLMs is publicly available at: https://github.com/pascalbenschopTU/VLLM_AnomalyRecognition
Problem

Research questions and friction points this paper is trying to address.

Evaluating Vision-LLMs for anomaly detection in surveillance videos
Assessing spatial reasoning capabilities for zero-shot event recognition
Testing privacy-preserving methods' impact on anomaly detection accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot anomaly detection via textual entailment
Converting video into text descriptions for scoring
Structure-aware prompts and spatial memory enhancement
🔎 Similar Papers
No similar papers found.
P
Pascal Benschop
Department of Computer Science, Delft University of Technology, Delft, Netherlands
C
Cristian Meo
LatentWorlds AI, Delft University of Technology, Delft, Netherlands
Justin Dauwels
Justin Dauwels
Delft University of Technology
graphical modelsmachine learningsignal processingcomputational neuroscienceintelligent transportation systems
J
Jelte P. Mense
National Policelab AI & Model-Driven Decisions Lab, Delft University of Technology, Delft, Netherlands