Motion Semantics Guided Normalizing Flow for Privacy-Preserving Video Anomaly Detection

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the limitation of existing skeleton-based video anomaly detection methods in simultaneously modeling discrete semantic primitives and fine-grained motion details of human activities, which constrains their multi-level anomaly discrimination capability. To overcome this, the paper introduces a hierarchical motion semantic modeling framework that decomposes skeleton sequences into interpretable semantic primitives and pose details. These components are jointly modeled using a vector-quantized variational autoencoder, an autoregressive Transformer, and conditional normalizing flows, enabling high-precision anomaly detection while preserving privacy. The proposed method achieves state-of-the-art performance with AUC scores of 88.1% on HR-ShanghaiTech and 75.8% on HR-UBnormal, significantly enhancing multi-granularity anomaly recognition.

Technology Category

Application Category

📝 Abstract

As embodied perception systems increasingly bridge digital and physical realms in interactive multimedia applications, the need for privacy-preserving approaches to understand human activities in physical environments has become paramount. Video anomaly detection is a critical task in such embodied multimedia systems for intelligent surveillance and forensic analysis. Skeleton-based approaches have emerged as a privacy-preserving alternative that processes physical world information through abstract human pose representations while discarding sensitive visual attributes such as identity and facial features. However, existing skeleton-based methods predominantly model continuous motion trajectories in a monolithic manner, failing to capture the hierarchical nature of human activities composed of discrete semantic primitives and fine-grained kinematic details, which leads to reduced discriminability when anomalies manifest at different abstraction levels. In this regard, we propose Motion Semantics Guided Normalizing Flow (MSG-Flow) that decomposes skeleton-based VAD into hierarchical motion semantics modeling. It employs vector quantized variational auto-encoder to discretize continuous motion into interpretable primitives, an autoregressive Transformer to model semantic-level temporal dependencies, and a conditional normalizing flow to capture detail-level pose variations. Extensive experiments on benchmarks (HR-ShanghaiTech & HR-UBnormal) demonstrate that MSG-Flow achieves state-of-the-art performance with 88.1% and 75.8% AUC respectively.

Problem

Research questions and friction points this paper is trying to address.

video anomaly detection

skeleton-based representation

hierarchical motion semantics

privacy-preserving

motion primitives

Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical motion semantics

vector quantized VAE

autoregressive Transformer