ExpVid: A Benchmark for Experiment Video Understanding & Reasoning

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

The experimental procedure understanding capability of existing multimodal large language models (MLLMs) in real-world wet-lab settings remains unclear, as mainstream benchmarks lack coverage of fine-grained manipulations and long-horizon procedural modeling. Method: We introduce ExpVid—the first multimodal evaluation benchmark for scientific experiment videos—featuring a three-tiered task hierarchy: fine-grained perception, procedural understanding, and scientific reasoning. ExpVid pioneers a vision-centric annotation paradigm, enabling systematic assessment of logical associations among tools, steps, and conclusions. Annotations are derived from peer-reviewed experimental videos and generated via automated pipelines validated by multidisciplinary domain experts, with strong emphasis on visual grounding. Contribution/Results: Evaluating 19 state-of-the-art MLLMs, we uncover critical deficiencies in state tracking and scientific reasoning. Notably, we provide the first quantitative characterization of performance gaps between closed-source and open-source models, revealing substantial disparities in procedural comprehension and causal inference.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) hold promise for accelerating scientific discovery by interpreting complex experimental procedures. However, their true capabilities are poorly understood, as existing benchmarks neglect the fine-grained and long-horizon nature of authentic laboratory work, especially in wet-lab settings. To bridge this gap, we introduce ExpVid, the first benchmark designed to systematically evaluate MLLMs on scientific experiment videos. Curated from peer-reviewed video publications, ExpVid features a new three-level task hierarchy that mirrors the scientific process: (1) Fine-grained Perception of tools, materials, and actions; (2) Procedural Understanding of step order and completeness; and (3) Scientific Reasoning that connects the full experiment to its published conclusions. Our vision-centric annotation pipeline, combining automated generation with multi-disciplinary expert validation, ensures that tasks require visual grounding. We evaluate 19 leading MLLMs on ExpVid and find that while they excel at coarse-grained recognition, they struggle with disambiguating fine details, tracking state changes over time, and linking experimental procedures to scientific outcomes. Our results reveal a notable performance gap between proprietary and open-source models, particularly in high-order reasoning. ExpVid not only provides a diagnostic tool but also charts a roadmap for developing MLLMs capable of becoming trustworthy partners in scientific experimentation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' ability to understand fine-grained scientific experiment videos

Assessing multimodal reasoning on procedural steps and state changes

Bridging the gap between experimental procedures and scientific conclusions

Innovation

Methods, ideas, or system contributions that make the work stand out.

First benchmark for scientific experiment video evaluation

Three-level task hierarchy mirroring scientific process

Vision-centric annotation with expert validation

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding