π€ AI Summary
Existing video retrieval benchmarks primarily focus on scene-level similarity, which is insufficient for evaluating fine-grained discriminative capabilities in surveillance scenarios involving vehicle actions. To address this limitation, this work proposes SOVABench, the first vehicle behavior retrieval benchmark tailored to real-world surveillance settings, and introduces two novel evaluation protocols to assess modelsβ understanding of action oppositionality and temporal directionality. Leveraging the visual reasoning and instruction-following capabilities of multimodal large language models (MLLMs), the proposed method generates interpretable textual embeddings for zero-shot image-to-video retrieval without requiring task-specific training. Experimental results demonstrate that this approach significantly outperforms conventional contrastive vision-language models on SOVABench as well as multiple spatial and counting benchmarks, confirming its effectiveness and strong generalization ability.
π Abstract
Automatic identification of events and recurrent behavior analysis are critical for video surveillance. However, most existing content-based video retrieval benchmarks focus on scene-level similarity and do not evaluate the action discrimination required in surveillance. To address this gap, we introduce SOVABench (Surveillance Opposite Vehicle Actions Benchmark), a real-world retrieval benchmark built from surveillance footage and centered on vehicle-related actions. SOVABench defines two evaluation protocols (inter-pair and intra-pair) to assess cross-action discrimination and temporal direction understanding. Although action distinctions are generally intuitive for human observers, our experiments show that they remain challenging for state-of-the-art vision and multimodal models. Leveraging the visual reasoning and instruction-following capabilities of Multimodal Large Language Models (MLLMs), we present a training-free framework for producing interpretable embeddings from MLLM-generated descriptions for both images and videos. The framework achieves strong performance on SOVABench as well as on several spatial and counting benchmarks where contrastive Vision-Language Models often fail. The code, annotations, and instructions to construct the benchmark are publicly available.