Flashback: Memory-Driven Zero-shot, Real-time Video Anomaly Detection

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video anomaly detection (VAD) faces dual challenges of real-time inference and cross-domain generalization under zero-shot settings. This paper proposes a “Recall–Respond” two-stage paradigm: offline, a multimodal model extracts normal video features, while an LLM generates pseudo-anomalous semantic descriptions to construct a retrievable hybrid-scene memory bank; online, lightweight embeddings and FAISS-based vector retrieval enable millisecond-scale zero-shot anomaly classification—requiring no real anomalous samples, no LLM invocation, and no gradient-based optimization. To our knowledge, this is the first fully zero-shot, purely lightweight, and real-time VAD framework, achieving end-to-end millisecond latency on consumer-grade GPUs. It attains 87.3 AUC (+7.0) on UCF-Crime and 75.1 AP (+13.1) on XD-Violence, significantly outperforming existing zero-shot VAD methods.

Technology Category

Application Category

📝 Abstract
Video Anomaly Detection (VAD) automatically identifies anomalous events from video, mitigating the need for human operators in large-scale surveillance deployments. However, three fundamental obstacles hinder real-world adoption: domain dependency and real-time constraints -- requiring near-instantaneous processing of incoming video. To this end, we propose Flashback, a zero-shot and real-time video anomaly detection paradigm. Inspired by the human cognitive mechanism of instantly judging anomalies and reasoning in current scenes based on past experience, Flashback operates in two stages: Recall and Respond. In the offline recall stage, an off-the-shelf LLM builds a pseudo-scene memory of both normal and anomalous captions without any reliance on real anomaly data. In the online respond stage, incoming video segments are embedded and matched against this memory via similarity search. By eliminating all LLM calls at inference time, Flashback delivers real-time VAD even on a consumer-grade GPU. On two large datasets from real-world surveillance scenarios, UCF-Crime and XD-Violence, we achieve 87.3 AUC (+7.0 pp) and 75.1 AP (+13.1 pp), respectively, outperforming prior zero-shot VAD methods by large margins.
Problem

Research questions and friction points this paper is trying to address.

Real-time video anomaly detection under domain constraints
Zero-shot learning without real anomaly data
Efficient inference without LLM calls during operation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLM to build pseudo-scene memory offline
Matches video segments via similarity search online
Eliminates LLM calls for real-time GPU processing
🔎 Similar Papers
No similar papers found.