A Benchmark for Crime Surveillance Video Analysis with Large Models

📅 2025-02-13

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) exhibit limited capability in understanding anomalies in surveillance videos, and no dedicated benchmark supports open-ended question-answering (QA) evaluation for this task. To address this gap, we introduce UCVL—the first MLLM-oriented benchmark for criminal surveillance video analysis—comprising 1,829 real-world videos with meticulously re-annotated multi-granularity anomaly labels and six categories of open-ended QA pairs. We innovatively propose a GPT-4o–driven automatic fine-grained textual response evaluation framework. Using UCVL, we systematically evaluate eight state-of-the-art MLLMs spanning 0.5B to 40B parameters. Furthermore, fine-tuning LLaVA-OneVision on UCVL yields significant performance gains (+12.3% average accuracy), demonstrating the benchmark’s high quality, practical utility, and generalizability. UCVL fills a critical void in evaluating large models for video anomaly understanding.

Technology Category

Application Category

📝 Abstract

Anomaly analysis in surveillance videos is a crucial topic in computer vision. In recent years, multimodal large language models (MLLMs) have outperformed task-specific models in various domains. Although MLLMs are particularly versatile, their abilities to understand anomalous concepts and details are insufficiently studied because of the outdated benchmarks of this field not providing MLLM-style QAs and efficient algorithms to assess the model's open-ended text responses. To fill this gap, we propose a benchmark for crime surveillance video analysis with large models denoted as UCVL, including 1,829 videos and reorganized annotations from the UCF-Crime and UCF-Crime Annotation datasets. We design six types of questions and generate diverse QA pairs. Then we develop detailed instructions and use OpenAI's GPT-4o for accurate assessment. We benchmark eight prevailing MLLMs ranging from 0.5B to 40B parameters, and the results demonstrate the reliability of this bench. Moreover, we finetune LLaVA-OneVision on UCVL's training set. The improvement validates our data's high quality for video anomaly analysis.

Problem

Research questions and friction points this paper is trying to address.

Develop benchmark for crime surveillance analysis

Evaluate MLLMs in video anomaly detection

Enhance model understanding of anomalous concepts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal large language models

OpenAI GPT-4o assessment

Finetuned LLaVA-OneVision model

🔎 Similar Papers

Missiongnn: Hierarchical Multimodal GNN-Based Weakly Supervised Video Anomaly Recognition with Mission-Specific Knowledge Graph Generation