🤖 AI Summary
Existing MLLM evaluation benchmarks are confined to single-video understanding, failing to support real-world applications—such as sports analytics and autonomous driving—that require multi-video collaborative reasoning. To address this gap, we propose MVU-Eval, the first comprehensive benchmark for multi-video understanding: it comprises a diverse dataset of 4,959 videos spanning eight core capabilities and 1,824 multi-video question-answer pairs designed to evaluate complex tasks including cross-view alignment and multi-sensor fusion. Systematic evaluation of leading open- and closed-source MLLMs reveals significant deficiencies in high-level reasoning—particularly cross-video temporal alignment, causal inference, and consistency judgment. MVU-Eval is the first benchmark to quantitatively expose these bottlenecks in multi-video understanding; moreover, it establishes a reproducible, fine-grained evaluation framework to guide future innovations in model architecture and training paradigms.
📝 Abstract
The advent of Multimodal Large Language Models (MLLMs) has expanded AI capabilities to visual modalities, yet existing evaluation benchmarks remain limited to single-video understanding, overlooking the critical need for multi-video understanding in real-world scenarios (e.g., sports analytics and autonomous driving). To address this significant gap, we introduce MVU-Eval, the first comprehensive benchmark for evaluating Multi-Video Understanding for MLLMs. Specifically, our MVU-Eval mainly assesses eight core competencies through 1,824 meticulously curated question-answer pairs spanning 4,959 videos from diverse domains, addressing both fundamental perception tasks and high-order reasoning tasks. These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics. Through extensive evaluation of state-of-the-art open-source and closed-source models, we reveal significant performance discrepancies and limitations in current MLLMs'ability to perform understanding across multiple videos. The benchmark will be made publicly available to foster future research.