RefineVAD: Semantic-Guided Feature Recalibration for Weakly Supervised Video Anomaly Detection

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing weakly supervised video anomaly detection (VAD) methods treat all anomalies as a single semantic class, neglecting their inherent semantic and temporal diversity. To address this limitation, we propose RefineVAD—a novel framework that, for the first time under weak supervision, jointly models fine-grained semantic structure and temporal motion patterns. Specifically, we introduce a semantics-guided feature recalibration module that embeds soft anomaly class priors into the representation space. Furthermore, we design a displacement-aware attention mechanism coupled with a global Transformer to capture long-range temporal dynamics, and employ cross-attention to align segment-level features with learnable class prototypes. Evaluated on the WVAD benchmark, RefineVAD achieves state-of-the-art performance, demonstrating that semantic-context-guided feature refinement is critical for enhancing anomaly discriminability in weakly supervised settings.

Technology Category

Application Category

📝 Abstract

Weakly-Supervised Video Anomaly Detection aims to identify anomalous events using only video-level labels, balancing annotation efficiency with practical applicability. However, existing methods often oversimplify the anomaly space by treating all abnormal events as a single category, overlooking the diverse semantic and temporal characteristics intrinsic to real-world anomalies. Inspired by how humans perceive anomalies, by jointly interpreting temporal motion patterns and semantic structures underlying different anomaly types, we propose RefineVAD, a novel framework that mimics this dual-process reasoning. Our framework integrates two core modules. The first, Motion-aware Temporal Attention and Recalibration (MoTAR), estimates motion salience and dynamically adjusts temporal focus via shift-based attention and global Transformer-based modeling. The second, Category-Oriented Refinement (CORE), injects soft anomaly category priors into the representation space by aligning segment-level features with learnable category prototypes through cross-attention. By jointly leveraging temporal dynamics and semantic structure, explicitly models both "how" motion evolves and "what" semantic category it resembles. Extensive experiments on WVAD benchmark validate the effectiveness of RefineVAD and highlight the importance of integrating semantic context to guide feature refinement toward anomaly-relevant patterns.

Problem

Research questions and friction points this paper is trying to address.

Detecting video anomalies using only video-level labels without frame annotations

Addressing oversimplification of diverse anomaly types and their characteristics

Modeling both temporal motion patterns and semantic structures of anomalies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Motion-aware attention recalibrates temporal focus dynamically

Category prototypes align features with semantic anomaly types

Jointly models motion evolution and semantic category resemblance

🔎 Similar Papers

MTFL: multi-timescale feature learning for weakly-supervised anomaly detection in surveillance videos