Is Video Anomaly Detection Misframed? Evidence from LLM-Based and Multi-Scene Models

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This study addresses a critical limitation in current video anomaly detection research, which prioritizes cross-scenario generalization at the expense of modeling the contextual dependencies and spatial locality inherent in normal behaviors within individual scenes. By advocating a return to a single-scenario-centered paradigm, this work emphasizes fine-grained modeling of spatial awareness, local geometry, and activity patterns. Through targeted visual analysis and empirical evaluation, it critically examines the representational capacity of multimodal large language models, revealing their deficiencies in spatial localization and fundamental task comprehension. The findings underscore the necessity and superiority of interpretable, spatially aware single-scenario models for effective anomaly detection.

📝 Abstract

Recent video anomaly detection research has expanded rapidly with an emphasis on general models of normality intended to work across many different scenes. While this focus has led to improvements in scalability and multi-scene generalization, it has also shifted the field away from modeling the scene-specific and context-dependent nature of normal behavior. Contemporary approaches frequently rely on video-level weak supervision and opaque pretrained representations from multi-modal large language models (MLLMs), which encourage models to respond to familiar semantic anomaly categories rather than to deviations from the normal patterns of a particular environment. This trend suppresses spatial localization, introduces semantic bias, and reduces anomaly detection to a form of action recognition. In this paper, we examine whether these prevailing formulations align with the core requirements of real-world VAD, which is typically performed within a single scene where normality is determined by local geometry, semantics, and activity patterns. Through targeted visual analyses and empirical evaluations, we demonstrate the practical consequences of these limitations and show that meaningful progress in VAD requires renewed focus on single-scene, spatially-aware, and explainable formulations that capture the nuanced structure of normality within individual environments.

Problem

Research questions and friction points this paper is trying to address.

video anomaly detection

scene-specific normality

spatial localization

semantic bias

weak supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

video anomaly detection

single-scene modeling

spatial awareness