🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit limited capability in understanding anomalies in surveillance videos, and no dedicated benchmark supports open-ended question-answering (QA) evaluation for this task. To address this gap, we introduce UCVL—the first MLLM-oriented benchmark for criminal surveillance video analysis—comprising 1,829 real-world videos with meticulously re-annotated multi-granularity anomaly labels and six categories of open-ended QA pairs. We innovatively propose a GPT-4o–driven automatic fine-grained textual response evaluation framework. Using UCVL, we systematically evaluate eight state-of-the-art MLLMs spanning 0.5B to 40B parameters. Furthermore, fine-tuning LLaVA-OneVision on UCVL yields significant performance gains (+12.3% average accuracy), demonstrating the benchmark’s high quality, practical utility, and generalizability. UCVL fills a critical void in evaluating large models for video anomaly understanding.
📝 Abstract
Anomaly analysis in surveillance videos is a crucial topic in computer vision. In recent years, multimodal large language models (MLLMs) have outperformed task-specific models in various domains. Although MLLMs are particularly versatile, their abilities to understand anomalous concepts and details are insufficiently studied because of the outdated benchmarks of this field not providing MLLM-style QAs and efficient algorithms to assess the model's open-ended text responses. To fill this gap, we propose a benchmark for crime surveillance video analysis with large models denoted as UCVL, including 1,829 videos and reorganized annotations from the UCF-Crime and UCF-Crime Annotation datasets. We design six types of questions and generate diverse QA pairs. Then we develop detailed instructions and use OpenAI's GPT-4o for accurate assessment. We benchmark eight prevailing MLLMs ranging from 0.5B to 40B parameters, and the results demonstrate the reliability of this bench. Moreover, we finetune LLaVA-OneVision on UCVL's training set. The improvement validates our data's high quality for video anomaly analysis.