CueBench: Advancing Unified Understanding of Context-Aware Video Anomalies in Real-World

📅 2025-11-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing deep models lack the capacity for deep contextual modeling (e.g., safety gear compliance) and complex causal reasoning in real-world video anomaly understanding (VAU), and no unified, context-aware evaluation benchmark exists. Method: We introduce CueBench—the first context-aware VAU benchmark—featuring an event-centric hierarchical taxonomy (14 conditional + 18 absolute anomaly classes) supporting multi-task evaluation (recognition, localization, detection, prediction); it enables fine-grained semantic modeling and fair comparison between generative and discriminative vision-language models (VLMs). We further propose a verifiable, task-aligned, and hierarchically refined reward mechanism based on R1-style reinforcement fine-tuning to train Cue-R1. Contribution/Results: Experiments show Cue-R1 achieves >24% average improvement over state-of-the-art methods on CueBench, significantly advancing real-scenario VAU performance.

Technology Category

Application Category

📝 Abstract
How far are deep models from real-world video anomaly understanding (VAU)? Current works typically emphasize on detecting unexpected occurrences deviated from normal patterns or comprehending anomalous events with interpretable descriptions. However, they exhibit only a superficial comprehension of real-world anomalies, with limited breadth in complex principles and subtle context that distinguish the anomalies from normalities, e.g., climbing cliffs with safety gear vs. without it. To this end, we introduce CueBench, the first of its kind Benchmark, devoted to Context-aware video anomalies within a Unified Evaluation framework. We comprehensively establish an event-centric hierarchical taxonomy that anchors two core event types: 14 conditional and 18 absolute anomaly events, defined by their refined semantics from diverse contexts across 174 scenes and 198 attributes. Based on this, we propose to unify and benchmark context-aware VAU with various challenging tasks across recognition, temporal grounding, detection, and anticipation. This also serves as a rigorous and fair probing evaluation suite for generative-discriminative as well as generalized-specialized vision-language models (VLMs). To address the challenges underlying CueBench, we further develop Cue-R1 based on R1-style reinforcement fine-tuning with verifiable, task-aligned, and hierarchy-refined rewards in a unified generative manner. Extensive results on CueBench reveal that, existing VLMs are still far from satisfactory real-world anomaly understanding, while our Cue-R1 surpasses these state-of-the-art approaches by over 24% on average.
Problem

Research questions and friction points this paper is trying to address.

Develops benchmark for context-aware video anomaly understanding
Unifies evaluation across recognition detection grounding anticipation
Addresses limitations in current models' real-world anomaly comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

CueBench benchmark for context-aware video anomaly evaluation
Cue-R1 model using reinforcement fine-tuning with verifiable rewards
Unified generative approach for multiple video understanding tasks
🔎 Similar Papers
No similar papers found.
Yating Yu
Yating Yu
Northwestern Polytechnical University
Video Understanding
Congqi Cao
Congqi Cao
School of Computer Science, Northwestern Polytechnical University
Computer VisionAction Recognition
Z
Zhaoying Wang
Northwestern Polytechnical University, Xi’an Shaanxi, 710129, China
W
Weihua Meng
Northwestern Polytechnical University, Xi’an Shaanxi, 710129, China
J
Jie Li
Northwestern Polytechnical University, Xi’an Shaanxi, 710129, China
Y
Yuxin Li
Northwestern Polytechnical University, Xi’an Shaanxi, 710129, China
Z
Zihao Wei
Northwestern Polytechnical University, Xi’an Shaanxi, 710129, China
Z
Zhongpei Shen
Northwestern Polytechnical University, Xi’an Shaanxi, 710129, China
Jiajun Zhang
Jiajun Zhang
Institute of Automation Chinese Academy of Sciences
Natural Language ProcessingLarge Language ModelsMultimodal Information Processing