A Benchmark for Crime Surveillance Video Analysis with Large Models

📅 2025-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit limited capability in understanding anomalies in surveillance videos, and no dedicated benchmark supports open-ended question-answering (QA) evaluation for this task. To address this gap, we introduce UCVL—the first MLLM-oriented benchmark for criminal surveillance video analysis—comprising 1,829 real-world videos with meticulously re-annotated multi-granularity anomaly labels and six categories of open-ended QA pairs. We innovatively propose a GPT-4o–driven automatic fine-grained textual response evaluation framework. Using UCVL, we systematically evaluate eight state-of-the-art MLLMs spanning 0.5B to 40B parameters. Furthermore, fine-tuning LLaVA-OneVision on UCVL yields significant performance gains (+12.3% average accuracy), demonstrating the benchmark’s high quality, practical utility, and generalizability. UCVL fills a critical void in evaluating large models for video anomaly understanding.

Technology Category

Application Category

📝 Abstract
Anomaly analysis in surveillance videos is a crucial topic in computer vision. In recent years, multimodal large language models (MLLMs) have outperformed task-specific models in various domains. Although MLLMs are particularly versatile, their abilities to understand anomalous concepts and details are insufficiently studied because of the outdated benchmarks of this field not providing MLLM-style QAs and efficient algorithms to assess the model's open-ended text responses. To fill this gap, we propose a benchmark for crime surveillance video analysis with large models denoted as UCVL, including 1,829 videos and reorganized annotations from the UCF-Crime and UCF-Crime Annotation datasets. We design six types of questions and generate diverse QA pairs. Then we develop detailed instructions and use OpenAI's GPT-4o for accurate assessment. We benchmark eight prevailing MLLMs ranging from 0.5B to 40B parameters, and the results demonstrate the reliability of this bench. Moreover, we finetune LLaVA-OneVision on UCVL's training set. The improvement validates our data's high quality for video anomaly analysis.
Problem

Research questions and friction points this paper is trying to address.

Develop benchmark for crime surveillance analysis
Evaluate MLLMs in video anomaly detection
Enhance model understanding of anomalous concepts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal large language models
OpenAI GPT-4o assessment
Finetuned LLaVA-OneVision model
🔎 Similar Papers
H
Haoran Chen
1Institute of Automation, Chinese Academy of Sciences, 2University of Chinese Academy of Sciences, 3Wuhan AI Research
D
Dongyi Yi
1Institute of Automation, Chinese Academy of Sciences, 2University of Chinese Academy of Sciences, 3Wuhan AI Research
M
Moyan Cao
2University of Chinese Academy of Sciences
C
Chensen Huang
1Institute of Automation, Chinese Academy of Sciences, 2University of Chinese Academy of Sciences, 3Wuhan AI Research
Guibo Zhu
Guibo Zhu
Institute of Automation, Chinese Academy of Sciecnes
Artificial IntelligenceComputer VisionMachine Learning
J
Jinqiao Wang
1Institute of Automation, Chinese Academy of Sciences, 2University of Chinese Academy of Sciences, 3Wuhan AI Research