🤖 AI Summary
Existing video anomaly detection methods are largely semi-automated, relying on manual evaluation and yielding only binary decisions—lacking fine-grained spatial localization and interactive interpretability. To address these limitations, we propose a text-guided fine-grained video anomaly detection framework, centered on an Anomaly Heatmap Decoder and a Region-aware Anomaly Encoder. These components enable pixel-level vision–language alignment and learnable textual embeddings, supporting both anomaly heatmaps and natural-language descriptions. Our method leverages a Large Vision–Language Model (LVLM), integrating visual feature decoding, region-aware encoding, and multimodal alignment. Evaluated on the UBnormal dataset, it achieves an AUC of 94.8%, with heatmap localization accuracy of 67.8% (IoU=0.5) and 76.7% (IoU=0.3). Moreover, generated textual descriptions significantly outperform those of prior approaches in fidelity and semantic relevance.
📝 Abstract
Video Anomaly Detection (VAD) aims to identify anomalous events within video segments. In scenarios such as surveillance or industrial process monitoring, anomaly detection is of critical importance. While existing approaches are semi-automated, requiring human assessment for anomaly detection, traditional VADs offer limited output as either normal or anomalous. We propose Text-guided Fine-Grained Video Anomaly Detection (T-VAD), a framework built upon Large Vision-Language Model (LVLM). T-VAD introduces an Anomaly Heatmap Decoder (AHD) that performs pixel-wise visual-textual feature alignment to generate fine-grained anomaly heatmaps. Furthermore, we design a Region-aware Anomaly Encoder (RAE) that transforms the heatmaps into learnable textual embeddings, guiding the LVLM to accurately identify and localize anomalous events in videos. This significantly enhances both the granularity and interactivity of anomaly detection. The proposed method achieving SOTA performance by demonstrating 94.8% Area Under the Curve (AUC, specifically micro-AUC) and 67.8%/76.7% accuracy in anomaly heatmaps (RBDC/TBDC) on the UBnormal dataset, and subjectively verified more preferable textual description on the ShanghaiTech-based dataset (BLEU-4: 62.67 for targets, 88.84 for trajectories; Yes/No accuracy: 97.67%), and on the UBnormal dataset (BLEU-4: 50.32 for targets, 78.10 for trajectories; Yes/No accuracy: 89.73%).