Advancing Adaptive Multi-Stage Video Anomaly Reasoning: A Benchmark Dataset and Method

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the limitations of existing video anomaly detection methods, which lack explicit, multi-stage reasoning mechanisms necessary for risk-aware and decision-oriented understanding. To bridge this gap, the authors introduce the Video Anomaly Reasoning (VAR) task and propose a novel annotation framework grounded in the Perception-Cognition-Action Chain-of-Thought (PerCoAct-CoT) paradigm. They release a large-scale benchmark dataset comprising 8,641 videos and over 50,000 annotated samples, alongside Vad-R1-Plus—an end-to-end multimodal large language model equipped with adaptive hierarchical reasoning and an Anomaly-Aware Group Relative Policy Optimization strategy. Experimental results demonstrate that the proposed approach significantly outperforms both open-source and closed-source baselines on the VAR task, effectively advancing multimodal large language models from descriptive comprehension toward decision-driven reasoning.

Technology Category

Application Category

📝 Abstract

Recent progress in reasoning capabilities of Multimodal Large Language Models(MLLMs) has highlighted their potential for performing complex video understanding tasks. However, in the domain of Video Anomaly Detection and Understanding (VAD&U), existing MLLM-based methods are largely limited to anomaly localization or post-hoc description, lacking explicit reasoning processes, risk awareness, and decision-oriented interpretation. To address this gap, we define a new task termed Video Anomaly Reasoning (VAR), which elevates video anomaly analysis from descriptive understanding to structured, multi-stage reasoning. VAR explicitly requires models to perform progressive reasoning over anomalous events before answering anomaly-related questions, encompassing visual perception, causal interpretation, and risk-aware decision making. To support this task, we present a new dataset with 8,641 videos, where each video is annotated with diverse question types corresponding to different reasoning depths, totaling more than 50,000 samples, making it one of the largest datasets for video anomaly. The annotations are based on a structured Perception-Cognition-Action Chain-of-Thought (PerCoAct-CoT), which formalizes domain-specific reasoning priors for video anomaly understanding. This design enables systematic evaluation of multi-stage and adaptive anomaly reasoning. In addition, we propose Anomaly-Aware Group Relative Policy Optimization to further enhance reasoning reliability under weak supervision. Building upon the proposed task and dataset, we develop an end-to-end MLLM-based VAR model termed Vad-R1-Plus, which supports adaptive hierarchical reasoning and risk-aware decision making. Extensive experiments demonstrate that the proposed benchmark and method effectively advance the reasoning capabilities of MLLMs on VAR tasks, outperforming both open-source and proprietary baselines.

Problem

Research questions and friction points this paper is trying to address.

Video Anomaly Reasoning

Multimodal Large Language Models

Reasoning Capability

Risk-aware Decision Making

Structured Reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video Anomaly Reasoning

Multimodal Large Language Models

PerCoAct-CoT