MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing MLLM evaluation benchmarks are limited to single-turn question answering, failing to capture the complex interactive requirements of realistic multi-turn video dialogues. To address this gap, we introduce VidDialogBench—the first benchmark specifically designed for evaluating multi-turn video dialogue understanding. It spans six practical scenarios, including interactive sports analysis and video-based instruction, and systematically defines six fine-grained capability dimensions covering perception and interaction. Built upon 987 manually curated, high-quality multi-turn dialogues, evaluation integrates both video content understanding and dialogue coherence. Experimental results reveal, for the first time, substantial performance disparities—and shared bottlenecks—between leading open- and closed-source MLLMs in multi-turn video dialogue tasks. VidDialogBench is publicly released, providing a standardized, reproducible testbed to advance research on natural human–machine video interaction.

Technology Category

Application Category

📝 Abstract
The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal LLMs in multi-turn video dialogues
Assessing core competencies for real-world video applications
Revealing performance gaps in interactive video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces multi-turn video dialogue benchmark
Assesses six core video understanding competencies
Evaluates multimodal LLMs on real-world applications
🔎 Similar Papers
No similar papers found.
Y
Yaning Pan
Fudan University
Z
Zekun Wang
Kuaishou Technology
Qianqian Xie
Qianqian Xie
Wuhan University
NLPLLM
Y
Yongqian Wen
Nanjing University
Yuanxing Zhang
Yuanxing Zhang
Kuaishou Technology
Recommender SystemLarge Language ModelVideo Understanding
Guohui Zhang
Guohui Zhang
Professor of Civil Engineering, University of Hawaii
Traffic EngineeringITSTraffic DetectionTraffic System ModelingSimulation
H
Haoxuan Hu
Nanjing University
Zhiyu Pan
Zhiyu Pan
Department of Automation, Tsinghua University
Computer VisionBiometrics
Y
Yibing Huang
Nanjing University
Z
Zhidong Gan
Nanjing University
Y
Yonghong Lin
Nanjing University
A
An Ping
Nanjing University
T
Tianhao Peng
Nanjing University
J
Jiaheng Liu
Nanjing University