CVBench: Evaluating Cross-Video Synergies for Complex Multimodal Understanding and Reasoning

📅 2025-08-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) perform well on single-video tasks but exhibit significant limitations in cross-video relational reasoning—such as inter-video object/event association and complex causal inference—hindering their deployment in real-world multi-camera surveillance systems. Method: We introduce CVBench, the first comprehensive benchmark explicitly designed for cross-video relational reasoning, comprising three hierarchical task levels: object-level, event-level, and complex reasoning. It evaluates over ten state-of-the-art MLLMs—including GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL—under zero-shot and chain-of-thought settings. Contribution/Results: Experimental results reveal a substantial performance gap: even the best-performing model achieves only 60% accuracy on causal reasoning tasks, markedly below human performance (91%), exposing fundamental bottlenecks in cross-video contextual coherence maintenance and entity disambiguation. CVBench provides a reproducible, diagnostics-oriented evaluation framework to advance both multi-video understanding assessment and next-generation model architecture design.

Technology Category

Application Category

📝 Abstract
While multimodal large language models (MLLMs) exhibit strong performance on single-video tasks (e.g., video question answering), their ability across multiple videos remains critically underexplored. However, this capability is essential for real-world applications, including multi-camera surveillance and cross-video procedural learning. To bridge this gap, we present CVBench, the first comprehensive benchmark designed to assess cross-video relational reasoning rigorously. CVBench comprises 1,000 question-answer pairs spanning three hierarchical tiers: cross-video object association (identifying shared entities), cross-video event association (linking temporal or causal event chains), and cross-video complex reasoning (integrating commonsense and domain knowledge). Built from five domain-diverse video clusters (e.g., sports, life records), the benchmark challenges models to synthesise information across dynamic visual contexts. Extensive evaluation of 10+ leading MLLMs (including GPT-4o, Gemini-2.0-flash, Qwen2.5-VL) under zero-shot or chain-of-thought prompting paradigms. Key findings reveal stark performance gaps: even top models, such as GPT-4o, achieve only 60% accuracy on causal reasoning tasks, compared to the 91% accuracy of human performance. Crucially, our analysis reveals fundamental bottlenecks inherent in current MLLM architectures, notably deficient inter-video context retention and poor disambiguation of overlapping entities. CVBench establishes a rigorous framework for diagnosing and advancing multi-video reasoning, offering architectural insights for next-generation MLLMs.The data and evaluation code are available at https://github.com/Hokhim2/CVBench.
Problem

Research questions and friction points this paper is trying to address.

Assessing multimodal models' cross-video relational reasoning capabilities
Evaluating model performance on multi-video object and event association
Identifying architectural bottlenecks in inter-video context retention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CVBench benchmark for cross-video evaluation
Tests object, event association and complex reasoning tiers
Reveals MLLM architecture limitations in inter-video context
🔎 Similar Papers
No similar papers found.
N
Nannan Zhu
Sun Yat-sen University
Y
Yonghao Dong
Sun Yat-sen University
T
Teng Wang
University of Hong Kong
Xueqian Li
Xueqian Li
Carnegie Mellon University
3D VisionComputer VisionDeep LearningRobotics
S
Shengjun Deng
Foshan University
Yijia Wang
Yijia Wang
PhD student, Institute of Theoretical Physics, Chinese Academy of Sciences
Statistical PhysicsMessage Passing AlgorithmsTensor NetworksCombinatorial Optimization Problem
Z
Zheng Hong
Sun Yat-sen University
T
Tiantian Geng
University of Birmingham
G
Guo Niu
Foshan University
H
Hanyan Huang
Sun Yat-sen University
X
Xiongfei Yao
Foshan University
S
Shuaiwei Jiao
Foshan University