HiProbe-VAD: Video Anomaly Detection via Hidden States Probing in Tuning-Free Multimodal LLMs

πŸ“… 2025-07-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Video anomaly detection (VAD) faces critical bottlenecks including high computational overhead and heavy reliance on large-scale annotated data. To address these, we propose HiProbe-VADβ€”a training-free, fine-tuning-free VAD framework that leverages frozen multimodal large language models (MLLMs). We make the first discovery that intermediate hidden-layer states of MLLMs exhibit high sensitivity and linear separability with respect to anomalous events. Building on this insight, we design a Dynamic Layer Significance Probing (DLSP) mechanism to adaptively identify the most discriminative hidden layer. HiProbe-VAD further integrates a lightweight anomaly scorer and a temporal localization module, operating directly on frozen MLLM features. The method achieves strong generalization and inherent interpretability without any parameter updates. Extensive experiments demonstrate that HiProbe-VAD significantly outperforms existing training-free approaches and surpasses most supervised and weakly supervised methods on UCF-Crime and XD-Violence, establishing a new paradigm for efficient, low-barrier VAD.

Technology Category

Application Category

πŸ“ Abstract
Video Anomaly Detection (VAD) aims to identify and locate deviations from normal patterns in video sequences. Traditional methods often struggle with substantial computational demands and a reliance on extensive labeled datasets, thereby restricting their practical applicability. To address these constraints, we propose HiProbe-VAD, a novel framework that leverages pre-trained Multimodal Large Language Models (MLLMs) for VAD without requiring fine-tuning. In this paper, we discover that the intermediate hidden states of MLLMs contain information-rich representations, exhibiting higher sensitivity and linear separability for anomalies compared to the output layer. To capitalize on this, we propose a Dynamic Layer Saliency Probing (DLSP) mechanism that intelligently identifies and extracts the most informative hidden states from the optimal intermediate layer during the MLLMs reasoning. Then a lightweight anomaly scorer and temporal localization module efficiently detects anomalies using these extracted hidden states and finally generate explanations. Experiments on the UCF-Crime and XD-Violence datasets demonstrate that HiProbe-VAD outperforms existing training-free and most traditional approaches. Furthermore, our framework exhibits remarkable cross-model generalization capabilities in different MLLMs without any tuning, unlocking the potential of pre-trained MLLMs for video anomaly detection and paving the way for more practical and scalable solutions.
Problem

Research questions and friction points this paper is trying to address.

Detects video anomalies without fine-tuning MLLMs
Reduces computational demands and labeled data reliance
Improves anomaly detection via hidden states analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages pre-trained MLLMs without fine-tuning
Dynamic Layer Saliency Probing extracts hidden states
Lightweight scorer detects anomalies efficiently
πŸ”Ž Similar Papers
No similar papers found.
Z
Zhaolin Cai
Xinjiang University, Urumqi, Xinjiang, China
F
Fan Li
Xi’an Jiaotong University, Xi’an, Shaanxi, China
Ziwei Zheng
Ziwei Zheng
Xi'an Jiaotong University
Dynamic Neural Network
Yanjun Qin
Yanjun Qin
Tsinghua University
Traffic ForecastingTransportation mode recognition