MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing MLLM evaluation benchmarks are limited to single-turn question answering, failing to capture the complex interactive requirements of realistic multi-turn video dialogues. To address this gap, we introduce VidDialogBench—the first benchmark specifically designed for evaluating multi-turn video dialogue understanding. It spans six practical scenarios, including interactive sports analysis and video-based instruction, and systematically defines six fine-grained capability dimensions covering perception and interaction. Built upon 987 manually curated, high-quality multi-turn dialogues, evaluation integrates both video content understanding and dialogue coherence. Experimental results reveal, for the first time, substantial performance disparities—and shared bottlenecks—between leading open- and closed-source MLLMs in multi-turn video dialogue tasks. VidDialogBench is publicly released, providing a standardized, reproducible testbed to advance research on natural human–machine video interaction.

Technology Category

Application Category

📝 Abstract

The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal LLMs in multi-turn video dialogues

Assessing core competencies for real-world video applications

Revealing performance gaps in interactive video understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces multi-turn video dialogue benchmark

Assesses six core video understanding competencies

Evaluates multimodal LLMs on real-world applications

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs