SOVABench: A Vehicle Surveillance Action Retrieval Benchmark for Multimodal Large Language Models

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing video retrieval benchmarks primarily focus on scene-level similarity, which is insufficient for evaluating fine-grained discriminative capabilities in surveillance scenarios involving vehicle actions. To address this limitation, this work proposes SOVABench, the first vehicle behavior retrieval benchmark tailored to real-world surveillance settings, and introduces two novel evaluation protocols to assess models’ understanding of action oppositionality and temporal directionality. Leveraging the visual reasoning and instruction-following capabilities of multimodal large language models (MLLMs), the proposed method generates interpretable textual embeddings for zero-shot image-to-video retrieval without requiring task-specific training. Experimental results demonstrate that this approach significantly outperforms conventional contrastive vision-language models on SOVABench as well as multiple spatial and counting benchmarks, confirming its effectiveness and strong generalization ability.

Technology Category

Application Category

📝 Abstract

Automatic identification of events and recurrent behavior analysis are critical for video surveillance. However, most existing content-based video retrieval benchmarks focus on scene-level similarity and do not evaluate the action discrimination required in surveillance. To address this gap, we introduce SOVABench (Surveillance Opposite Vehicle Actions Benchmark), a real-world retrieval benchmark built from surveillance footage and centered on vehicle-related actions. SOVABench defines two evaluation protocols (inter-pair and intra-pair) to assess cross-action discrimination and temporal direction understanding. Although action distinctions are generally intuitive for human observers, our experiments show that they remain challenging for state-of-the-art vision and multimodal models. Leveraging the visual reasoning and instruction-following capabilities of Multimodal Large Language Models (MLLMs), we present a training-free framework for producing interpretable embeddings from MLLM-generated descriptions for both images and videos. The framework achieves strong performance on SOVABench as well as on several spatial and counting benchmarks where contrastive Vision-Language Models often fail. The code, annotations, and instructions to construct the benchmark are publicly available.

Problem

Research questions and friction points this paper is trying to address.

video surveillance

action retrieval

vehicle actions

multimodal large language models

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

SOVABench

Multimodal Large Language Models

action retrieval