Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video captioning evaluation benchmarks suffer from coarse keypoint coverage, severe homogeneity, high construction costs, and limited evaluation dimensions. To address these issues, we propose AutoCaption—a novel framework introducing an iterative, fine-grained video keypoint generation paradigm based on Monte Carlo Tree Search (MCTS)—to automatically construct MCTS-VCB, the first high-quality, multi-dimensional benchmark covering actions, attributes, environments, and more. Our method integrates MCTS-guided diverse sampling, multimodal large language model (MLLM)-based semantic parsing, and adaptive fine-tuning. We systematically evaluate over 20 state-of-the-art multimodal large models on MCTS-VCB, where Gemini-1.5-Pro achieves the highest F1 score of 71.2. Furthermore, fine-tuning InternVL2.5-8B on AutoCaption data yields substantial improvements—+25.0% on MCTS-VCB and +16.3% on DREAM-1K—demonstrating significant gains in comprehensiveness and practicality for video understanding evaluation.

Technology Category

Application Category

📝 Abstract
Video captioning can be used to assess the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, existing benchmarks and evaluation protocols suffer from crucial issues, such as inadequate or homogeneous creation of key points, exorbitant cost of data creation, and limited evaluation scopes. To address these issues, we propose an automatic framework, named AutoCaption, which leverages Monte Carlo Tree Search (MCTS) to construct numerous and diverse descriptive sentences ( extit{i.e.}, key points) that thoroughly represent video content in an iterative way. This iterative captioning strategy enables the continuous enhancement of video details such as actions, objects' attributes, environment details, etc. We apply AutoCaption to curate MCTS-VCB, a fine-grained video caption benchmark covering video details, thereby enabling a comprehensive evaluation of MLLMs on the video captioning task. We evaluate more than 20 open- and closed-source MLLMs of varying sizes on MCTS-VCB. Results show that MCTS-VCB can effectively and comprehensively evaluate the video captioning capability, with Gemini-1.5-Pro achieving the highest F1 score of 71.2. Interestingly, we fine-tune InternVL2.5-8B with the AutoCaption-generated data, which helps the model achieve an overall improvement of 25.0% on MCTS-VCB and 16.3% on DREAM-1K, further demonstrating the effectiveness of AutoCaption. The code and data are available at https://github.com/tjunlp-lab/MCTS-VCB.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' video understanding via captioning benchmarks
Addressing inadequate diversity in video key points creation
Reducing high costs in video caption data generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Monte Carlo Tree Search for diverse captions
AutoCaption framework enhances video details iteratively
MCTS-VCB benchmark for comprehensive MLLM evaluation
🔎 Similar Papers
No similar papers found.
L
Linhao Yu
TJUNLP Lab, College of Intelligence and Computing, Tianjin University, Tianjin, China
X
Xinguang Ji
Independent Researcher
Y
Yahui Liu
Independent Researcher
Fanheng Kong
Fanheng Kong
Northeastern University; Kuaishou Technology
Multimodal LLMMultimodal Understanding
C
Chenxi Sun
Independent Researcher
J
Jingyuan Zhang
Independent Researcher
Hongzhi Zhang
Hongzhi Zhang
Professor of Computer Science and Technology, Harbin Institute of Technology
Deep LearningArtificial IntelligenceComputer Vision
W
W. V.
Independent Researcher
F
Fuzheng Zhang
Independent Researcher
Deyi Xiong
Deyi Xiong
Professor, College of Intelligence and Computing, Tianjin University, China
Natural Language ProcessingLarge Language ModelsAI4Science