π€ AI Summary
Existing training-free video anomaly detection methods suffer from unstable performance and poor interpretability in complex or hierarchical scenes due to their reliance on static prompts and neglect of scene geometry. This work proposes MM-VAD, a novel framework that, for the first time, models scene hierarchy in hyperbolic space and reformulates anomaly detection as a test-time adaptive inference process. It constructs hierarchical semantic representations via hyperspherical embeddings, leverages a frozen large language modelβs question-answering mechanism for context-aware judgment, and employs lightweight prompt optimization together with covariance-aware Mahalanobis distance alignment. Requiring no training, the method achieves state-of-the-art results across four benchmarks, attaining AUC scores of 90.03% (XD-Violence), 83.24% (UCF-Crime), 96.95% (ShanghaiTech), and 98.81% (UCSD Ped2).
π Abstract
Training-free video anomaly detection (VAD) has recently emerged as a scalable alternative to supervised approaches, yet existing methods largely rely on static prompting and geometry-agnostic feature fusion. As a result, anomaly inference is often reduced to shallow similarity matching over Euclidean embeddings, leading to unstable predictions and limited interpretability, especially in complex or hierarchically structured scenes. We introduce MM-VAD, a geometry-aware semantic reasoning framework for training free VAD that reframes anomaly detection as adaptive test-time inference rather than fixed feature comparison. Our approach projects caption-derived scene representations into hyperbolic space to better preserve hierarchical structure and performs anomaly assessment through an adaptive question answering process over a frozen large language model. A lightweight, learnable prompt is optimised at test time using an unsupervised confidence-sparsity objective, enabling context-specific calibration without updating any backbone parameters. To further ground semantic predictions in visual evidence, we incorporate a covariance-aware Mahalanobis refinement that stabilises cross-modal alignment. Across four benchmarks, MM-VAD consistently improves over prior training-free methods, achieving 90.03% AUC on XD-Violence and 83.24%, 96.95%, and 98.81% on UCF-Crime, ShanghaiTech, and UCSD Ped2, respectively. Our results demonstrate that geometry-aware representation and adaptive semantic calibration provide a principled and effective alternative to static Euclidean matching in training-free VAD.