🤖 AI Summary
This work addresses the high deployment cost of traditional neurosymbolic task planning, which relies on extensive handcrafted relaxation and complementarity rules along with hundreds of training problems to supervise graph neural networks. The authors propose LLM-Flax, a three-stage framework that requires only a PDDL domain file and leverages an off-the-shelf large language model (LLM) to autonomously generate planning rules, perform failure recovery, and evaluate object importance in a zero-shot manner—eliminating the need for manual intervention or training data. Key innovations include structured prompting, a self-correction mechanism, LLM-guided failure recovery, and a feasibility-gated budgeting strategy to optimize LLM query efficiency. The method achieves an average success rate of 0.945 across all eight MazeNamo benchmarks, substantially outperforming the human-authored baseline (0.828), and demonstrates dramatic improvements on challenging instances: rising from 0 to 0.733 on 12×12 Expert tasks and attaining perfect performance (1.000) on 15×15 Hard tasks.
📝 Abstract
Deploying a neuro-symbolic task planner on a new domain today requires significant manual effort: a domain expert must author relaxation and complementary rules, and hundreds of training problems must be solved to supervise a Graph Neural Network (GNN) object scorer. We propose LLM-Flax, a three-stage framework that eliminates all three sources of manual effort using a locally hosted LLM given only a PDDL domain file. Stage 1 automatically generates relaxation and complementary rules via structured prompting with format validation and self-correction. Stage 2 introduces LLM-guided failure recovery with a feasibility-gated budget policy that explicitly reserves API latency cost before each LLM call, preventing the downstream relaxation fallback from being starved. Stage 3 replaces the domain-trained GNN entirely with zero-shot LLM object importance scoring, requiring no training data. We evaluate all three stages on the MazeNamo benchmark across 10x10, 12x12, and 15x15 grids (8 benchmarks total). LLM-Flax achieves average SR 0.945 versus the manual baseline's 0.828 (+0.117), matching or outperforming manual rules on every one of the eight benchmarks. On 12x12 Expert, LLM-Flax attains SR 0.733 where the manual planner fails entirely (SR 0.000); on 15x15 Hard, it achieves SR 1.000 versus Manual's 0.900. Stage 3 demonstrates feasibility (SR 0.720 on 12x12 Hard with no training data) but faces a context-window bottleneck at scale, pointing to the primary open challenge for future work.