SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses a critical gap in evaluating the safety of multimodal large language models (MLLMs), whose current assessment is confined to static question-answering and overlooks their capacity to proactively mitigate real-world hazards during embodied interactions. To bridge this gap, the authors introduce SafetyALFRED, an evaluation framework built upon the ALFRED benchmark that encompasses six categories of realistic kitchen safety hazards. For the first time, safety assessment is extended beyond hazard recognition to include embodied risk-mitigation behaviors. End-to-end evaluations of prominent models—including Qwen, Gemma, and Gemini—reveal that while these models accurately identify dangers, their success rate in executing effective safety interventions during tasks remains markedly low. This discrepancy highlights a significant alignment gap between perception and action, advocating for a paradigm shift in safety evaluation toward emphasizing corrective, embodied responses.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset under https://github.com/sled-group/SafetyALFRED.git

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

Safety Evaluation

Embodied Planning

Hazard Mitigation

Interactive Environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

embodied planning

safety evaluation

multimodal large language models