Text-guided Fine-Grained Video Anomaly Detection

📅 2025-11-01

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing video anomaly detection methods are largely semi-automated, relying on manual evaluation and yielding only binary decisions—lacking fine-grained spatial localization and interactive interpretability. To address these limitations, we propose a text-guided fine-grained video anomaly detection framework, centered on an Anomaly Heatmap Decoder and a Region-aware Anomaly Encoder. These components enable pixel-level vision–language alignment and learnable textual embeddings, supporting both anomaly heatmaps and natural-language descriptions. Our method leverages a Large Vision–Language Model (LVLM), integrating visual feature decoding, region-aware encoding, and multimodal alignment. Evaluated on the UBnormal dataset, it achieves an AUC of 94.8%, with heatmap localization accuracy of 67.8% (IoU=0.5) and 76.7% (IoU=0.3). Moreover, generated textual descriptions significantly outperform those of prior approaches in fidelity and semantic relevance.

Technology Category

Application Category

📝 Abstract

Video Anomaly Detection (VAD) aims to identify anomalous events within video segments. In scenarios such as surveillance or industrial process monitoring, anomaly detection is of critical importance. While existing approaches are semi-automated, requiring human assessment for anomaly detection, traditional VADs offer limited output as either normal or anomalous. We propose Text-guided Fine-Grained Video Anomaly Detection (T-VAD), a framework built upon Large Vision-Language Model (LVLM). T-VAD introduces an Anomaly Heatmap Decoder (AHD) that performs pixel-wise visual-textual feature alignment to generate fine-grained anomaly heatmaps. Furthermore, we design a Region-aware Anomaly Encoder (RAE) that transforms the heatmaps into learnable textual embeddings, guiding the LVLM to accurately identify and localize anomalous events in videos. This significantly enhances both the granularity and interactivity of anomaly detection. The proposed method achieving SOTA performance by demonstrating 94.8% Area Under the Curve (AUC, specifically micro-AUC) and 67.8%/76.7% accuracy in anomaly heatmaps (RBDC/TBDC) on the UBnormal dataset, and subjectively verified more preferable textual description on the ShanghaiTech-based dataset (BLEU-4: 62.67 for targets, 88.84 for trajectories; Yes/No accuracy: 97.67%), and on the UBnormal dataset (BLEU-4: 50.32 for targets, 78.10 for trajectories; Yes/No accuracy: 89.73%).

Problem

Research questions and friction points this paper is trying to address.

Detecting anomalous events in video segments automatically

Generating fine-grained anomaly heatmaps using text guidance

Enhancing anomaly localization accuracy with visual-textual alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages Large Vision-Language Model for video anomaly detection

Introduces Anomaly Heatmap Decoder for pixel-wise feature alignment

Designs Region-aware Anomaly Encoder for learnable textual embeddings

🔎 Similar Papers

No similar papers found.