ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the challenges of low model efficiency, lack of streaming capability, and inflexible anomaly definitions in open-world video anomaly detection by proposing the first training-free streaming detection framework. The method integrates structured user prompts, intra-frame token merging to compress visual redundancy, a hybrid streaming memory mechanism with causal reasoning, and a novel definition normalization technique to enable real-time adaptation to dynamically specified anomaly semantics. Built upon a multimodal large language model, the system achieves real-time inference on a single GPU and sets new state-of-the-art performance in anomaly temporal localization, classification, and natural language description generation. To foster further research in this direction, the authors also introduce OpenDef-Bench, a new benchmark for open-world, definition-driven anomaly detection.

Technology Category

Application Category

📝 Abstract

Open-world video anomaly detection (OWVAD) aims to detect and explain abnormal events under different anomaly definitions, which is important for applications such as intelligent surveillance and live-streaming content moderation. Recent MLLM-based methods have shown promising open-world generalization, but still suffer from three major limitations: inefficiency for practical deployment, lack of streaming processing adaptation, and limited support for dynamic anomaly definitions in both modeling and evaluation. To address these issues, this paper proposes ESOM, an efficient streaming OWVAD model that operates in a training-free manner. ESOM includes a Definition Normalization module to structure user prompts for reducing hallucination, an Inter-frame-matched Intra-frame Token Merging module to compress redundant visual tokens, a Hybrid Streaming Memory module for efficient causal inference, and a Probabilistic Scoring module that converts interval-level textual outputs into frame-level anomaly scores. In addition, this paper introduces OpenDef-Bench, a new benchmark with clean surveillance videos and diverse natural anomaly definitions for evaluating performance under varying conditions. Extensive experiments show that ESOM achieves real-time efficiency on a single GPU and state-of-the-art performance in anomaly temporal localization, classification, and description generation. The code and benchmark will be released at https://github.com/Kamino666/ESOM_OpenDef-Bench.

Problem

Research questions and friction points this paper is trying to address.

open-world video anomaly detection

streaming video

dynamic anomaly definitions

real-time efficiency

anomaly localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming Video Anomaly Detection

Open-world Learning

Training-free Model