🤖 AI Summary
The experimental procedure understanding capability of existing multimodal large language models (MLLMs) in real-world wet-lab settings remains unclear, as mainstream benchmarks lack coverage of fine-grained manipulations and long-horizon procedural modeling. Method: We introduce ExpVid—the first multimodal evaluation benchmark for scientific experiment videos—featuring a three-tiered task hierarchy: fine-grained perception, procedural understanding, and scientific reasoning. ExpVid pioneers a vision-centric annotation paradigm, enabling systematic assessment of logical associations among tools, steps, and conclusions. Annotations are derived from peer-reviewed experimental videos and generated via automated pipelines validated by multidisciplinary domain experts, with strong emphasis on visual grounding. Contribution/Results: Evaluating 19 state-of-the-art MLLMs, we uncover critical deficiencies in state tracking and scientific reasoning. Notably, we provide the first quantitative characterization of performance gaps between closed-source and open-source models, revealing substantial disparities in procedural comprehension and causal inference.
📝 Abstract
Multimodal Large Language Models (MLLMs) hold promise for accelerating scientific discovery by interpreting complex experimental procedures. However, their true capabilities are poorly understood, as existing benchmarks neglect the fine-grained and long-horizon nature of authentic laboratory work, especially in wet-lab settings. To bridge this gap, we introduce ExpVid, the first benchmark designed to systematically evaluate MLLMs on scientific experiment videos. Curated from peer-reviewed video publications, ExpVid features a new three-level task hierarchy that mirrors the scientific process: (1) Fine-grained Perception of tools, materials, and actions; (2) Procedural Understanding of step order and completeness; and (3) Scientific Reasoning that connects the full experiment to its published conclusions. Our vision-centric annotation pipeline, combining automated generation with multi-disciplinary expert validation, ensures that tasks require visual grounding. We evaluate 19 leading MLLMs on ExpVid and find that while they excel at coarse-grained recognition, they struggle with disambiguating fine details, tracking state changes over time, and linking experimental procedures to scientific outcomes. Our results reveal a notable performance gap between proprietary and open-source models, particularly in high-order reasoning. ExpVid not only provides a diagnostic tool but also charts a roadmap for developing MLLMs capable of becoming trustworthy partners in scientific experimentation.