LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This study addresses the critical gap in reliable safety risk identification and decision-making capabilities of current multimodal large language models (MLLMs) in high-stakes scientific experimentation. To this end, the authors introduce the first multi-perspective safety evaluation benchmark grounded in OSHA and GHS standards, encompassing 164 experimental tasks spanning varying complexities and risk levels. They propose a dual-track assessment framework combining multiple-choice and semi-open-ended question answering. Systematic evaluation of 20 closed-source, 9 open-source, and 3 embodied models reveals an average 32.0% performance drop in safety-related reasoning within professional laboratory settings, underscoring significant deficiencies in hazard interpretation and safety planning. This benchmark establishes a foundational resource for advancing safety-critical multimodal agents.

Technology Category

Application Category

📝 Abstract

Artificial intelligence is increasingly catalyzing scientific automation, with multimodal large language model (MLLM) agents evolving from lab assistants into self-driving lab operators. This transition imposes stringent safety requirements on laboratory environments, where fragile glassware, hazardous substances, and high-precision laboratory equipment render planning errors or misinterpreted risks potentially irreversible. However, the safety awareness and decision-making reliability of embodied agents in such high-stakes settings remain insufficiently defined and evaluated. To bridge this gap, we introduce LABSHIELD, a realistic multi-view benchmark designed to assess MLLMs in hazard identification and safety-critical reasoning. Grounded in U.S. Occupational Safety and Health Administration (OSHA) standards and the Globally Harmonized System (GHS), LABSHIELD establishes a rigorous safety taxonomy spanning 164 operational tasks with diverse manipulation complexities and risk profiles. We evaluate 20 proprietary models, 9 open-source models, and 3 embodied models under a dual-track evaluation framework. Our results reveal a systematic gap between general-domain MCQ accuracy and Semi-open QA safety performance, with models exhibiting an average drop of 32.0% in professional laboratory scenarios, particularly in hazard interpretation and safety-aware planning. These findings underscore the urgent necessity for safety-centric reasoning frameworks to ensure reliable autonomous scientific experimentation in embodied laboratory contexts. The full dataset will be released soon.

Problem

Research questions and friction points this paper is trying to address.

safety-critical reasoning

multimodal large language models

hazard identification

embodied agents

scientific laboratories

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal benchmark

safety-critical reasoning

embodied AI