Aligning Effective Tokens with Video Anomaly in Large Language Models

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Current video multimodal large language models (MLLMs) exhibit limited performance in anomalous event understanding, primarily due to the spatiotemporal sparsity of anomalies and interference from redundant video content that hinders critical feature extraction. To address this, we propose a spatiotemporal collaborative token alignment framework: it introduces Spatially Effective Token Selection (SETS) and Temporally Effective Token Generation (TETG) modules to precisely capture local anomaly details and their dynamic evolution. We further construct VA-Instruction—the first instruction-tuning dataset dedicated to video anomaly understanding—and VA-Bench, a cross-domain evaluation benchmark. Our method integrates a visual encoder with a large language model, enabling joint optimization of anomaly localization and semantic interpretation via fine-grained token-level alignment. Experiments demonstrate significant improvements over state-of-the-art methods: localization accuracy increases by 12.7%, and interactive response quality improves by 23.4% across multiple benchmarks.

Technology Category

Application Category

📝 Abstract

Understanding abnormal events in videos is a vital and challenging task that has garnered significant attention in a wide range of applications. Although current video understanding Multi-modal Large Language Models (MLLMs) are capable of analyzing general videos, they often struggle to handle anomalies due to the spatial and temporal sparsity of abnormal events, where the redundant information always leads to suboptimal outcomes. To address these challenges, exploiting the representation and generalization capabilities of Vison Language Models (VLMs) and Large Language Models (LLMs), we propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos. Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG). These modules enable our model to effectively capture and analyze both spatial and temporal information associated with abnormal events, resulting in more accurate responses and interactions. Furthermore, we construct an instruction-following dataset specifically for fine-tuning video-anomaly-aware MLLMs, and introduce a cross-domain evaluation benchmark based on XD-Violence dataset. Our proposed method outperforms existing state-of-the-art methods on various benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Detecting video anomalies with sparse spatial-temporal events

Aligning visual tokens with language models for anomaly analysis

Improving accuracy in summarizing and localizing abnormal events

Innovation

Methods, ideas, or system contributions that make the work stand out.

VA-GPT aligns tokens via SETS and TETG

Spatial-temporal modules capture anomaly events

Cross-domain benchmark enhances evaluation accuracy

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs