MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Existing MLLM evaluation benchmarks are confined to single-video understanding, failing to support real-world applications—such as sports analytics and autonomous driving—that require multi-video collaborative reasoning. To address this gap, we propose MVU-Eval, the first comprehensive benchmark for multi-video understanding: it comprises a diverse dataset of 4,959 videos spanning eight core capabilities and 1,824 multi-video question-answer pairs designed to evaluate complex tasks including cross-view alignment and multi-sensor fusion. Systematic evaluation of leading open- and closed-source MLLMs reveals significant deficiencies in high-level reasoning—particularly cross-video temporal alignment, causal inference, and consistency judgment. MVU-Eval is the first benchmark to quantitatively expose these bottlenecks in multi-video understanding; moreover, it establishes a reproducible, fine-grained evaluation framework to guide future innovations in model architecture and training paradigms.

Technology Category

Application Category

📝 Abstract

The advent of Multimodal Large Language Models (MLLMs) has expanded AI capabilities to visual modalities, yet existing evaluation benchmarks remain limited to single-video understanding, overlooking the critical need for multi-video understanding in real-world scenarios (e.g., sports analytics and autonomous driving). To address this significant gap, we introduce MVU-Eval, the first comprehensive benchmark for evaluating Multi-Video Understanding for MLLMs. Specifically, our MVU-Eval mainly assesses eight core competencies through 1,824 meticulously curated question-answer pairs spanning 4,959 videos from diverse domains, addressing both fundamental perception tasks and high-order reasoning tasks. These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics. Through extensive evaluation of state-of-the-art open-source and closed-source models, we reveal significant performance discrepancies and limitations in current MLLMs'ability to perform understanding across multiple videos. The benchmark will be made publicly available to foster future research.

Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks lack multi-video understanding evaluation for MLLMs

Real-world applications require analyzing multiple videos simultaneously

Current MLLMs show significant limitations in multi-video comprehension

Innovation

Methods, ideas, or system contributions that make the work stand out.

MVU-Eval benchmark for multi-video understanding

Assesses eight core competencies through curated QA pairs

Rigorously aligned with real-world multi-sensor applications

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs