🤖 AI Summary
This study addresses the critical gap in reliable safety risk identification and decision-making capabilities of current multimodal large language models (MLLMs) in high-stakes scientific experimentation. To this end, the authors introduce the first multi-perspective safety evaluation benchmark grounded in OSHA and GHS standards, encompassing 164 experimental tasks spanning varying complexities and risk levels. They propose a dual-track assessment framework combining multiple-choice and semi-open-ended question answering. Systematic evaluation of 20 closed-source, 9 open-source, and 3 embodied models reveals an average 32.0% performance drop in safety-related reasoning within professional laboratory settings, underscoring significant deficiencies in hazard interpretation and safety planning. This benchmark establishes a foundational resource for advancing safety-critical multimodal agents.
📝 Abstract
Artificial intelligence is increasingly catalyzing scientific automation, with multimodal large language model (MLLM) agents evolving from lab assistants into self-driving lab operators. This transition imposes stringent safety requirements on laboratory environments, where fragile glassware, hazardous substances, and high-precision laboratory equipment render planning errors or misinterpreted risks potentially irreversible. However, the safety awareness and decision-making reliability of embodied agents in such high-stakes settings remain insufficiently defined and evaluated. To bridge this gap, we introduce LABSHIELD, a realistic multi-view benchmark designed to assess MLLMs in hazard identification and safety-critical reasoning. Grounded in U.S. Occupational Safety and Health Administration (OSHA) standards and the Globally Harmonized System (GHS), LABSHIELD establishes a rigorous safety taxonomy spanning 164 operational tasks with diverse manipulation complexities and risk profiles. We evaluate 20 proprietary models, 9 open-source models, and 3 embodied models under a dual-track evaluation framework. Our results reveal a systematic gap between general-domain MCQ accuracy and Semi-open QA safety performance, with models exhibiting an average drop of 32.0% in professional laboratory scenarios, particularly in hazard interpretation and safety-aware planning. These findings underscore the urgent necessity for safety-centric reasoning frameworks to ensure reliable autonomous scientific experimentation in embodied laboratory contexts. The full dataset will be released soon.