π€ AI Summary
Existing weakly supervised video anomaly detection (VAD) methods treat all anomalies as a single semantic class, neglecting their inherent semantic and temporal diversity. To address this limitation, we propose RefineVADβa novel framework that, for the first time under weak supervision, jointly models fine-grained semantic structure and temporal motion patterns. Specifically, we introduce a semantics-guided feature recalibration module that embeds soft anomaly class priors into the representation space. Furthermore, we design a displacement-aware attention mechanism coupled with a global Transformer to capture long-range temporal dynamics, and employ cross-attention to align segment-level features with learnable class prototypes. Evaluated on the WVAD benchmark, RefineVAD achieves state-of-the-art performance, demonstrating that semantic-context-guided feature refinement is critical for enhancing anomaly discriminability in weakly supervised settings.
π Abstract
Weakly-Supervised Video Anomaly Detection aims to identify anomalous events using only video-level labels, balancing annotation efficiency with practical applicability. However, existing methods often oversimplify the anomaly space by treating all abnormal events as a single category, overlooking the diverse semantic and temporal characteristics intrinsic to real-world anomalies. Inspired by how humans perceive anomalies, by jointly interpreting temporal motion patterns and semantic structures underlying different anomaly types, we propose RefineVAD, a novel framework that mimics this dual-process reasoning. Our framework integrates two core modules. The first, Motion-aware Temporal Attention and Recalibration (MoTAR), estimates motion salience and dynamically adjusts temporal focus via shift-based attention and global Transformer-based modeling. The second, Category-Oriented Refinement (CORE), injects soft anomaly category priors into the representation space by aligning segment-level features with learnable category prototypes through cross-attention. By jointly leveraging temporal dynamics and semantic structure, explicitly models both "how" motion evolves and "what" semantic category it resembles. Extensive experiments on WVAD benchmark validate the effectiveness of RefineVAD and highlight the importance of integrating semantic context to guide feature refinement toward anomaly-relevant patterns.