🤖 AI Summary
Existing work on natural language-to-planning (NL2PDDL) translation relies on PDDL generation but lacks rigorous semantic correctness evaluation and uses simplistic, unrealistic datasets. Method: We introduce NL2PDDL-Bench—the first rigorous benchmark—featuring (i) a state-reachability–based algorithm for semantic equivalence checking of PDDL programs; (ii) a large-scale, hierarchical dataset comprising 146K natural language–PDDL pairs covering 73 realistic state combinations; and (iii) a cross-model evaluation framework incorporating state-of-the-art LLMs including GPT-4o and Llama-3. Contribution/Results: Experiments reveal that even the strongest current models achieve only 24.8% semantic correctness, underscoring the task’s substantial difficulty. NL2PDDL-Bench establishes a reproducible, scalable, and semantics-driven evaluation standard for NL2PDDL translation, enabling principled progress in structured planning language generation.
📝 Abstract
Recent works have explored using language models for planning problems. One approach examines translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). Existing evaluation methods struggle to ensure semantic correctness and rely on simple or unrealistic datasets. To bridge this gap, we introduce extit{Planetarium}, a benchmark designed to evaluate language models' ability to generate PDDL code from natural language descriptions of planning tasks. extit{Planetarium} features a novel PDDL equivalence algorithm that flexibly evaluates the correctness of generated PDDL, along with a dataset of 145,918 text-to-PDDL pairs across 73 unique state combinations with varying levels of difficulty. Finally, we evaluate several API-access and open-weight language models that reveal this task's complexity. For example, 96.1% of the PDDL problem descriptions generated by GPT-4o are syntactically parseable, 94.4% are solvable, but only 24.8% are semantically correct, highlighting the need for a more rigorous benchmark for this problem.