🤖 AI Summary
Existing benchmarks struggle to systematically evaluate the practical capabilities of multimodal large language models (MLLMs) across the full workflow of real-world polymer science experiments. To address this gap, this work introduces the first multimodal evaluation benchmark that spans the entire research lifecycle, focusing on five core competencies: application of fundamental theory, laboratory safety analysis, experimental mechanism reasoning, extraction of raw data, and exploration of performance and applications. The benchmark integrates multimodal inputs—including text and images—and features structured tasks and evaluation metrics closely aligned with authentic scientific practice. Evaluations of leading MLLMs reveal strong performance on knowledge-intensive tasks but significant deficiencies in practical aspects such as laboratory safety and raw data handling, highlighting a critical disconnect between abstract knowledge and real-world applicability.
📝 Abstract
Multimodal Large Language Models (MLLMs) excel in general domains but struggle with complex, real-world science. We posit that polymer science, an interdisciplinary field spanning chemistry, physics, biology, and engineering, is an ideal high-stakes testbed due to its diverse multimodal data. Yet, existing benchmarks related to polymer science largely overlook real-world workflows, limiting their practical utility and failing to systematically evaluate MLLMs across the full, practice-grounded lifecycle of experimentation. We introduce PolyReal, a novel multimodal benchmark grounded in real-world scientific practices to evaluate MLLMs on the full lifecycle of polymer experimentation. It covers five critical capabilities: (1) foundational knowledge application; (2) lab safety analysis; (3) experiment mechanism reasoning; (4) raw data extraction; and (5) performance & application exploration. Our evaluation of leading MLLMs on PolyReal reveals a capability imbalance. While models perform well on knowledge-intensive reasoning (e.g., Experiment Mechanism Reasoning), they drop sharply on practice-based tasks (e.g., Lab Safety Analysis and Raw Data Extraction). This exposes a severe gap between abstract scientific knowledge and its practical, context-dependent application, showing that these real-world tasks remain challenging for MLLMs. Thus, PolyReal helps address this evaluation gap and provides a practical benchmark for assessing AI systems in real-world scientific workflows.