🤖 AI Summary
Weakly supervised video anomaly detection (VAD) traditionally relies solely on RGB inputs, limiting its ability to discriminate fine-grained, visually similar anomalies (e.g., shoplifting). To address this, we propose PI-VAD: a multimodal-induction framework for weakly supervised VAD that takes only RGB videos as input and dynamically generates and fuses five pseudo-modal features—pose, depth, panoptic segmentation, optical flow, and vision-language model (VLM) embeddings. PI-VAD introduces a novel “pseudo-modal generation + cross-modal induction” dual-plugin mechanism: multimodal backbones are employed only during training, incurring zero inference overhead. Furthermore, it employs a polygonal geometric metaphor to unify cross-modal alignment and fusion. Evaluated on three real-world benchmarks—XD11, UCF-Crime, and ShanghaiTech—PI-VAD achieves state-of-the-art performance, notably improving localization accuracy for fine-grained anomalies while maintaining inference efficiency comparable to single-RGB models.
📝 Abstract
Weakly-supervised methods for video anomaly detection (VAD) are conventionally based merely on RGB spatio-temporal features, which continues to limit their reliability in real-world scenarios. This is due to the fact that RGB-features are not sufficiently distinctive in setting apart categories such as shoplifting from visually similar events. Therefore, towards robust complex real-world VAD, it is essential to augment RGB spatio-temporal features by additional modalities. Motivated by this, we introduce the Poly-modal Induced framework for VAD:"PI-VAD", a novel approach that augments RGB representations by five additional modalities. Specifically, the modalities include sensitivity to fine-grained motion (Pose), three dimensional scene and entity representation (Depth), surrounding objects (Panoptic masks), global motion (optical flow), as well as language cues (VLM). Each modality represents an axis of a polygon, streamlined to add salient cues to RGB. PI-VAD includes two plug-in modules, namely Pseudo-modality Generation module and Cross Modal Induction module, which generate modality-specific prototypical representation and, thereby, induce multi-modal information into RGB cues. These modules operate by performing anomaly-aware auxiliary tasks and necessitate five modality backbones -- only during training. Notably, PI-VAD achieves state-of-the-art accuracy on three prominent VAD datasets encompassing real-world scenarios, without requiring the computational overhead of five modality backbones at inference.