Action Hints: Semantic Typicality and Context Uniqueness for Generalizable Skeleton-based Video Anomaly Detection

📅 2025-09-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

198K/year
🤖 AI Summary
To address the challenges in zero-shot video anomaly detection (ZS-VAD)—namely, the absence of target-domain training data and poor generalization to diverse unseen normal/abnormal behaviors across novel surveillance scenarios—this paper proposes a skeleton-based joint modeling framework leveraging semantic typicality and contextual uniqueness. We map skeleton sequences into an action-semantic space via language-guided semantic embedding, and integrate large-model knowledge distillation with spatiotemporal discrepancy analysis to enable test-time scene-adaptive anomaly boundary estimation. Crucially, our approach eliminates reliance on domain-specific, fixed normal-pattern priors, thereby substantially enhancing cross-scenario generalizability. Evaluated on four major benchmarks—ShanghaiTech, UBnormal, NWPU, and UCF-Crime—our method achieves state-of-the-art performance among skeleton-based ZS-VAD approaches, successfully detecting anomalies across over 100 previously unseen surveillance scenes.

Technology Category

Application Category

📝 Abstract
Zero-Shot Video Anomaly Detection (ZS-VAD) requires temporally localizing anomalies without target domain training data, which is a crucial task due to various practical concerns, e.g., data privacy or new surveillance deployments. Skeleton-based approach has inherent generalizable advantages in achieving ZS-VAD as it eliminates domain disparities both in background and human appearance. However, existing methods only learn low-level skeleton representation and rely on the domain-limited normality boundary, which cannot generalize well to new scenes with different normal and abnormal behavior patterns. In this paper, we propose a novel zero-shot video anomaly detection framework, unlocking the potential of skeleton data via action typicality and uniqueness learning. Firstly, we introduce a language-guided semantic typicality modeling module that projects skeleton snippets into action semantic space and distills LLM's knowledge of typical normal and abnormal behaviors during training. Secondly, we propose a test-time context uniqueness analysis module to finely analyze the spatio-temporal differences between skeleton snippets and then derive scene-adaptive boundaries. Without using any training samples from the target domain, our method achieves state-of-the-art results against skeleton-based methods on four large-scale VAD datasets: ShanghaiTech, UBnormal, NWPU, and UCF-Crime, featuring over 100 unseen surveillance scenes.
Problem

Research questions and friction points this paper is trying to address.

Zero-shot anomaly detection without target domain training data
Generalizing skeleton-based methods to unseen surveillance scenes
Overcoming domain disparities in normal and abnormal behavior patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-guided semantic typicality modeling
Test-time context uniqueness analysis
Scene-adaptive boundaries derivation
🔎 Similar Papers
No similar papers found.