π€ AI Summary
This work addresses the safety challenges in embodied task planning under partial observability and physical constraints, a domain where existing benchmarks lack systematic evaluation of plan feasibility and safety. We propose the first safety-oriented benchmark for embodied task planning, which uniquely integrates strict partial observability, both explicit and implicit physical constraints, and diverse household hazard scenarios into a unified evaluation framework. The benchmark introduces state- and constraint-based online metrics and incorporates a goal-conditioned, step-by-step planning mechanism to enable fine-grained assessment of large language modelsβ safety-aware planning capabilities. Experimental results reveal that current state-of-the-art models struggle to ensure safety under implicit constraints, highlighting their limitations in real-world deployment.
π Abstract
Embodied Task Planning with large language models faces safety challenges in real-world environments, where partial observability and physical constraints must be respected. Existing benchmarks often overlook these critical factors, limiting their ability to evaluate both feasibility and safety. We introduce SPOC, a benchmark for safety-aware embodied task planning, which integrates strict partial observability, physical constraints, step-by-step planning, and goal-condition-based evaluation. Covering diverse household hazards such as fire, fluid, injury, object damage, and pollution, SPOC enables rigorous assessment through both state and constraint-based online metrics. Experiments with state-of-the-art LLMs reveal that current models struggle to ensure safety-aware planning, particularly under implicit constraints. Code and dataset are available at https://github.com/khm159/SPOC