ExpVid: A Benchmark for Experiment Video Understanding & Reasoning

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The experimental procedure understanding capability of existing multimodal large language models (MLLMs) in real-world wet-lab settings remains unclear, as mainstream benchmarks lack coverage of fine-grained manipulations and long-horizon procedural modeling. Method: We introduce ExpVid—the first multimodal evaluation benchmark for scientific experiment videos—featuring a three-tiered task hierarchy: fine-grained perception, procedural understanding, and scientific reasoning. ExpVid pioneers a vision-centric annotation paradigm, enabling systematic assessment of logical associations among tools, steps, and conclusions. Annotations are derived from peer-reviewed experimental videos and generated via automated pipelines validated by multidisciplinary domain experts, with strong emphasis on visual grounding. Contribution/Results: Evaluating 19 state-of-the-art MLLMs, we uncover critical deficiencies in state tracking and scientific reasoning. Notably, we provide the first quantitative characterization of performance gaps between closed-source and open-source models, revealing substantial disparities in procedural comprehension and causal inference.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) hold promise for accelerating scientific discovery by interpreting complex experimental procedures. However, their true capabilities are poorly understood, as existing benchmarks neglect the fine-grained and long-horizon nature of authentic laboratory work, especially in wet-lab settings. To bridge this gap, we introduce ExpVid, the first benchmark designed to systematically evaluate MLLMs on scientific experiment videos. Curated from peer-reviewed video publications, ExpVid features a new three-level task hierarchy that mirrors the scientific process: (1) Fine-grained Perception of tools, materials, and actions; (2) Procedural Understanding of step order and completeness; and (3) Scientific Reasoning that connects the full experiment to its published conclusions. Our vision-centric annotation pipeline, combining automated generation with multi-disciplinary expert validation, ensures that tasks require visual grounding. We evaluate 19 leading MLLMs on ExpVid and find that while they excel at coarse-grained recognition, they struggle with disambiguating fine details, tracking state changes over time, and linking experimental procedures to scientific outcomes. Our results reveal a notable performance gap between proprietary and open-source models, particularly in high-order reasoning. ExpVid not only provides a diagnostic tool but also charts a roadmap for developing MLLMs capable of becoming trustworthy partners in scientific experimentation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' ability to understand fine-grained scientific experiment videos
Assessing multimodal reasoning on procedural steps and state changes
Bridging the gap between experimental procedures and scientific conclusions
Innovation

Methods, ideas, or system contributions that make the work stand out.

First benchmark for scientific experiment video evaluation
Three-level task hierarchy mirroring scientific process
Vision-centric annotation with expert validation
🔎 Similar Papers
No similar papers found.
Yicheng Xu
Yicheng Xu
Tokyo Institute of Technology
Computer VisionContinual LearningKernel Method
Y
Yue Wu
Shanghai AI Laboratory
Jiashuo Yu
Jiashuo Yu
Shanghai AI Laboratory
Audio-Visual LearningComputer VisionMultimodal Learning
Z
Ziang Yan
Shanghai AI Laboratory
T
Tianxiang Jiang
Shanghai AI Laboratory
Yinan He
Yinan He
Shanghai Al Laboratory
Qingsong Zhao
Qingsong Zhao
tongji
Machine LearningComputer Vision
K
Kai Chen
Shanghai AI Laboratory
Y
Yu Qiao
Shanghai AI Laboratory
L
Limin Wang
Shanghai AI Laboratory, Nanjing University
M
Manabu Okumura
Institute of Science Tokyo
Y
Yi Wang
Shanghai AI Laboratory