🤖 AI Summary
This work addresses the limited executability and generalization of large language models (LLMs) on high-complexity real-world planning tasks—e.g., meeting scheduling. To overcome the poor robustness and inflexible scaling of conventional autoregressive planning, we propose *programming-oriented planning*: automatically compiling natural-language instructions into executable Python or CSP solver code, augmented with *reasoning scaling*—a mechanism enabling adaptive expansion of reasoning length proportional to task complexity. We conduct the first systematic evaluation of both open- and closed-source LLMs under this paradigm, revealing that the programming path—though not universally applicable—often substantially outperforms autoregressive planning. We identify insufficient robustness and efficiency in code generation as the primary bottleneck for generalization and introduce a fine-grained error analysis benchmark to diagnose failures. Experiments demonstrate consistent improvements in accuracy and interpretability across diverse planning domains.
📝 Abstract
Real-life textual planning tasks such as meeting scheduling have posed much challenge to LLMs especially when the complexity is high. While previous work primarily studied auto-regressive generation of plans with closed-source models, we systematically evaluate both closed- and open-source models, including those that scales output length with complexity during inference, in generating programs, which are executed to output the plan. We consider not only standard Python code, but also the code to a constraint satisfaction problem solver. Despite the algorithmic nature of the task, we show that programming often but not always outperforms planning. Our detailed error analysis also indicates a lack of robustness and efficiency in the generated code that hinders generalization.