ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of low model efficiency, lack of streaming capability, and inflexible anomaly definitions in open-world video anomaly detection by proposing the first training-free streaming detection framework. The method integrates structured user prompts, intra-frame token merging to compress visual redundancy, a hybrid streaming memory mechanism with causal reasoning, and a novel definition normalization technique to enable real-time adaptation to dynamically specified anomaly semantics. Built upon a multimodal large language model, the system achieves real-time inference on a single GPU and sets new state-of-the-art performance in anomaly temporal localization, classification, and natural language description generation. To foster further research in this direction, the authors also introduce OpenDef-Bench, a new benchmark for open-world, definition-driven anomaly detection.
📝 Abstract
Open-world video anomaly detection (OWVAD) aims to detect and explain abnormal events under different anomaly definitions, which is important for applications such as intelligent surveillance and live-streaming content moderation. Recent MLLM-based methods have shown promising open-world generalization, but still suffer from three major limitations: inefficiency for practical deployment, lack of streaming processing adaptation, and limited support for dynamic anomaly definitions in both modeling and evaluation. To address these issues, this paper proposes ESOM, an efficient streaming OWVAD model that operates in a training-free manner. ESOM includes a Definition Normalization module to structure user prompts for reducing hallucination, an Inter-frame-matched Intra-frame Token Merging module to compress redundant visual tokens, a Hybrid Streaming Memory module for efficient causal inference, and a Probabilistic Scoring module that converts interval-level textual outputs into frame-level anomaly scores. In addition, this paper introduces OpenDef-Bench, a new benchmark with clean surveillance videos and diverse natural anomaly definitions for evaluating performance under varying conditions. Extensive experiments show that ESOM achieves real-time efficiency on a single GPU and state-of-the-art performance in anomaly temporal localization, classification, and description generation. The code and benchmark will be released at https://github.com/Kamino666/ESOM_OpenDef-Bench.
Problem

Research questions and friction points this paper is trying to address.

open-world video anomaly detection
streaming video
dynamic anomaly definitions
real-time efficiency
anomaly localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming Video Anomaly Detection
Open-world Learning
Training-free Model
Token Merging
Dynamic Anomaly Definition
🔎 Similar Papers
No similar papers found.
Z
Zihao Liu
State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China
Xiaoyu Wu
Xiaoyu Wu
Central University of Finance and Economics
development economicslabor economicshealth economics
W
Wenna Li
State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China
J
Jianqin Wu
State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China
Linlin Yang
Linlin Yang
Communication University of China
Computer VisionMachine Learning