Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

📅 2024-07-03

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Existing work on natural language-to-planning (NL2PDDL) translation relies on PDDL generation but lacks rigorous semantic correctness evaluation and uses simplistic, unrealistic datasets. Method: We introduce NL2PDDL-Bench—the first rigorous benchmark—featuring (i) a state-reachability–based algorithm for semantic equivalence checking of PDDL programs; (ii) a large-scale, hierarchical dataset comprising 146K natural language–PDDL pairs covering 73 realistic state combinations; and (iii) a cross-model evaluation framework incorporating state-of-the-art LLMs including GPT-4o and Llama-3. Contribution/Results: Experiments reveal that even the strongest current models achieve only 24.8% semantic correctness, underscoring the task’s substantial difficulty. NL2PDDL-Bench establishes a reproducible, scalable, and semantics-driven evaluation standard for NL2PDDL translation, enabling principled progress in structured planning language generation.

Technology Category

Application Category

📝 Abstract

Recent works have explored using language models for planning problems. One approach examines translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). Existing evaluation methods struggle to ensure semantic correctness and rely on simple or unrealistic datasets. To bridge this gap, we introduce extit{Planetarium}, a benchmark designed to evaluate language models' ability to generate PDDL code from natural language descriptions of planning tasks. extit{Planetarium} features a novel PDDL equivalence algorithm that flexibly evaluates the correctness of generated PDDL, along with a dataset of 145,918 text-to-PDDL pairs across 73 unique state combinations with varying levels of difficulty. Finally, we evaluate several API-access and open-weight language models that reveal this task's complexity. For example, 96.1% of the PDDL problem descriptions generated by GPT-4o are syntactically parseable, 94.4% are solvable, but only 24.8% are semantically correct, highlighting the need for a more rigorous benchmark for this problem.

Problem

Research questions and friction points this paper is trying to address.

Evaluate language models for PDDL translation

Ensure semantic correctness in planning tasks

Introduce a rigorous benchmark for natural language to PDDL

Innovation

Methods, ideas, or system contributions that make the work stand out.

PDDL equivalence algorithm

text-to-PDDL dataset

language model evaluation

🔎 Similar Papers

Automating the Generation of Prompts for LLM-based Action Choice in PDDL Planning