🤖 AI Summary
This work addresses the challenge in video anomaly detection that existing methods often rely on large-scale annotations or task-specific training, hindering rapid generalization to novel scenes. To overcome this limitation, the paper proposes a training-free, zero-shot detection framework that leverages the geometric structure of intermediate-layer features from pretrained multimodal foundation models, projected onto a unit hypersphere. Anomaly discrimination is formulated as a likelihood ratio test under the von Mises–Fisher distribution, enabling geodesic inference on the sphere. Directional prototype alignment is achieved through Fréchet mean centering, holistic scene attention (HSA), and spherical geodesic pulling (SGP). The method sets new state-of-the-art performance among training-free approaches on three mainstream benchmarks, achieving results comparable to fully supervised models.
📝 Abstract
Video anomaly detection (VAD) aims to automatically identify events that deviate from normal patterns in untrimmed surveillance videos. Existing methods universally depend on large-scale annotations or task-specific training procedures, severely limiting their rapid deployment to novel scenes. We observe that intermediate-layer features of pre-trained multimodal large language models (MLLMs) already encode rich anomaly semantics, yet existing approaches rely on the language output pathway and fail to exploit the geometric discriminability latent in these representations. Based on this finding, we propose SphereVAD, a fully training-free, zero-shot VAD framework that recasts anomaly discrimination as von Mises-Fisher (vMF) likelihood-ratio geodesic inference on the unit hypersphere, unleashing latent discriminability through principled geometric reasoning rather than learning new representations. Specifically, SphereVAD first applies Frechet mean centering to unfold feature distributions and eliminate domain biases, then employs Holistic Scene Attention (HSA) to reinforce feature consistency using cross-video priors, and finally performs vMF-guided Spherical Geodesic Pulling (SGP) to align ambiguous segments with directional prototypes on the spherical manifold. This training-free pipeline requires only minimal synthetic images for calibration. SphereVAD establishes new state-of-the-art results among training-free approaches on three major benchmarks and remains competitive with fully supervised baselines. Code will be available upon acceptance.