๐ค AI Summary
This work addresses the lack of systematic evaluation of safety in plans generated by large language models (LLMs) for robotic task planning. The authors introduce DESPITE, a benchmark comprising 12,279 tasks involving hazardous scenarios, which revealsโfor the first timeโthat planning capability and safety awareness exhibit a multiplicative relationship. While increasing model scale primarily enhances task planning proficiency, it does not proportionally improve hazard avoidance, making safety awareness a critical bottleneck for real-world deployment. Through deterministic verification, multi-scale comparisons between open- and closed-source models, and separate evaluations of reasoning versus non-reasoning architectures, the study finds that the best-performing model achieves a task failure rate of only 0.4% yet still produces unsafe plans in 28.3% of cases. Notably, advanced reasoning models demonstrate safety awareness levels of 71โ81%, substantially outperforming open-source counterparts (38โ57%).
๐ Abstract
Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with fully deterministic validation. Across 23 models, even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among 18 open-source models from 3B to 671B parameters, planning ability improves substantially with scale (0.4-99.3%) while safety awareness remains relatively flat (38-57%). We identify a multiplicative relationship between these two capacities, showing that larger models complete more tasks safely primarily through improved planning, not through better danger avoidance. Three proprietary reasoning models reach notably higher safety awareness (71-81%), while non-reasoning proprietary models and open-source reasoning models remain below 57%. As planning ability approaches saturation for frontier models, improving safety awareness becomes a central challenge for deploying language-model planners in robotic systems.