Natural Language Planning via Coding and Inference Scaling

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the limited executability and generalization of large language models (LLMs) on high-complexity real-world planning tasks—e.g., meeting scheduling. To overcome the poor robustness and inflexible scaling of conventional autoregressive planning, we propose *programming-oriented planning*: automatically compiling natural-language instructions into executable Python or CSP solver code, augmented with *reasoning scaling*—a mechanism enabling adaptive expansion of reasoning length proportional to task complexity. We conduct the first systematic evaluation of both open- and closed-source LLMs under this paradigm, revealing that the programming path—though not universally applicable—often substantially outperforms autoregressive planning. We identify insufficient robustness and efficiency in code generation as the primary bottleneck for generalization and introduce a fine-grained error analysis benchmark to diagnose failures. Experiments demonstrate consistent improvements in accuracy and interpretability across diverse planning domains.

Technology Category

Application Category

📝 Abstract

Real-life textual planning tasks such as meeting scheduling have posed much challenge to LLMs especially when the complexity is high. While previous work primarily studied auto-regressive generation of plans with closed-source models, we systematically evaluate both closed- and open-source models, including those that scales output length with complexity during inference, in generating programs, which are executed to output the plan. We consider not only standard Python code, but also the code to a constraint satisfaction problem solver. Despite the algorithmic nature of the task, we show that programming often but not always outperforms planning. Our detailed error analysis also indicates a lack of robustness and efficiency in the generated code that hinders generalization.

Problem

Research questions and friction points this paper is trying to address.

Challenges in complex real-life textual planning tasks

Evaluation of closed- and open-source models for plan generation

Lack of robustness and efficiency in generated code

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses coding and inference scaling for planning

Evaluates closed- and open-source models systematically

Generates programs executed to output plans

🔎 Similar Papers

Unlocking Large Language Model's Planning Capabilities with Maximum Diversity Fine-tuning