🤖 AI Summary
Video anomaly detection (VAD) faces dual challenges of real-time inference and cross-domain generalization under zero-shot settings. This paper proposes a “Recall–Respond” two-stage paradigm: offline, a multimodal model extracts normal video features, while an LLM generates pseudo-anomalous semantic descriptions to construct a retrievable hybrid-scene memory bank; online, lightweight embeddings and FAISS-based vector retrieval enable millisecond-scale zero-shot anomaly classification—requiring no real anomalous samples, no LLM invocation, and no gradient-based optimization. To our knowledge, this is the first fully zero-shot, purely lightweight, and real-time VAD framework, achieving end-to-end millisecond latency on consumer-grade GPUs. It attains 87.3 AUC (+7.0) on UCF-Crime and 75.1 AP (+13.1) on XD-Violence, significantly outperforming existing zero-shot VAD methods.
📝 Abstract
Video Anomaly Detection (VAD) automatically identifies anomalous events from video, mitigating the need for human operators in large-scale surveillance deployments. However, three fundamental obstacles hinder real-world adoption: domain dependency and real-time constraints -- requiring near-instantaneous processing of incoming video. To this end, we propose Flashback, a zero-shot and real-time video anomaly detection paradigm. Inspired by the human cognitive mechanism of instantly judging anomalies and reasoning in current scenes based on past experience, Flashback operates in two stages: Recall and Respond. In the offline recall stage, an off-the-shelf LLM builds a pseudo-scene memory of both normal and anomalous captions without any reliance on real anomaly data. In the online respond stage, incoming video segments are embedded and matched against this memory via similarity search. By eliminating all LLM calls at inference time, Flashback delivers real-time VAD even on a consumer-grade GPU. On two large datasets from real-world surveillance scenarios, UCF-Crime and XD-Violence, we achieve 87.3 AUC (+7.0 pp) and 75.1 AP (+13.1 pp), respectively, outperforming prior zero-shot VAD methods by large margins.