🤖 AI Summary
This work addresses the limitations of existing video anomaly detection methods based on frozen multimodal large language models (MLLMs), which struggle to capture subtle or ambiguous anomalies due to pretraining biases. To overcome this, we propose SteerVAD, a framework that efficiently steers internal MLLM representations toward task-specific video contexts without fine-tuning. SteerVAD introduces, for the first time, a gradient-free Representational Separability Analysis (RSA) to identify expert attention heads sensitive to anomalies and employs a Hierarchical Meta-Controller (HMC) to dynamically generate anisotropic correction signals that precisely modulate the latent anomaly representation manifold. Evaluated on mainstream benchmarks using only 1% of training data, SteerVAD achieves state-of-the-art performance among tuning-free methods, significantly enhancing the detection of complex anomalies.
📝 Abstract
Video anomaly detection (VAD) aims to identify abnormal events in videos. Traditional VAD methods generally suffer from the high costs of labeled data and full training, thus some recent works have explored leveraging frozen multi-modal large language models (MLLMs) in a tuning-free manner to perform VAD. However, their performance is limited as they directly inherit pre-training biases and cannot adapt internal representations to specific video contexts, leading to difficulties in handling subtle or ambiguous anomalies. To address these limitations, we propose a novel intervention framework, termed SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations. Our approach first leverages the gradient-free representational separability analysis (RSA) to identify top attention heads as latent anomaly experts (LAEs) which are most discriminative for VAD. Then a hierarchical meta-controller (HMC) generates dynamic rectification signals by jointly conditioning on global context and these LAE outputs. The signals execute targeted, anisotropic scaling directly upon the LAE representation manifolds, amplifying anomaly-relevant dimensions while suppressing inherent biases. Extensive experiments on mainstream benchmarks demonstrate our method achieves state-of-the-art performance among tuning-free approaches requiring only 1% of training data, establishing it as a powerful new direction for video anomaly detection. The code will be released upon the publication.