Sherlock: Towards Multi-scene Video Abnormal Event Extraction and Localization via a Global-local Spatial-sensitive LLM

📅 2025-02-26

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Current video anomaly detection (VAD) methods operate solely at the frame level, lacking the capability to model and spatiotemporally localize structured semantic elements of anomalous events—such as subjects, event types, objects, and scenes. To address this limitation, we introduce the novel task of Multi-scenario Video Anomalous Event Extraction and Localization (M-VAE). We construct the first M-VAE instruction-tuning dataset and propose a global-local spatially aware Video Large Language Model (Video-LLM). Our architecture incorporates a Spatially Enhanced Mixture-of-Experts (GSM) module and a Spatial Imbalance Regulator (SIR), enabling fine-grained semantic parsing and precise spatiotemporal localization. Experiments demonstrate that our method achieves 12.7%–18.3% improvements over state-of-the-art video LLMs in both quadruple extraction accuracy and spatiotemporal localization precision, significantly advancing the structured semantic understanding of video anomalies.

Technology Category

Application Category

📝 Abstract

Prior studies on Video Anomaly Detection (VAD) mainly focus on detecting whether each video frame is abnormal or not in the video, which largely ignore the structured video semantic information (i.e., what, when, and where does the abnormal event happen). With this in mind, we propose a new chat-paradigm extbf{M}ulti-scene Video Abnormal Event Extraction and Localization (M-VAE) task, aiming to extract the abnormal event quadruples (i.e., subject, event type, object, scene) and localize such event. Further, this paper believes that this new task faces two key challenges, i.e., global-local spatial modeling and global-local spatial balancing. To this end, this paper proposes a Global-local Spatial-sensitive Large Language Model (LLM) named Sherlock, i.e., acting like Sherlock Holmes to track down the criminal events, for this M-VAE task. Specifically, this model designs a Global-local Spatial-enhanced MoE (GSM) module and a Spatial Imbalance Regulator (SIR) to address the two challenges respectively. Extensive experiments on our M-VAE instruction dataset show the significant advantages of Sherlock over several advanced Video-LLMs. This justifies the importance of global-local spatial information for the M-VAE task and the effectiveness of Sherlock in capturing such information.

Problem

Research questions and friction points this paper is trying to address.

Multi-scene Video Abnormal Event Extraction

Global-local Spatial-sensitive Modeling

Event Localization and Quadruple Extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Global-local Spatial-sensitive LLM

Global-local Spatial-enhanced MoE

Spatial Imbalance Regulator

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs