🤖 AI Summary
Large language models (LLMs) exhibit weak reasoning and low success rates in multi-constrained real-world planning tasks—e.g., travel planning—due to their inability to guarantee logical consistency and constraint satisfaction.
Method: We propose the first LLM–formal verification co-planning framework: natural language planning requests are automatically compiled into SMT (Satisfiability Modulo Theories) instances and validated by complete solvers (e.g., Z3); a novel closed-loop diagnostic mechanism supports unsatisfiable core extraction, failure attribution, and prompt rewriting. The method integrates NL-to-logic translation, declarative constraint modeling, and adaptive prompt refinement for strong zero-shot cross-domain generalization.
Contribution/Results: On the TravelPlanner benchmark, planning success rises from 10% to 93.9%; correction rates for unsatisfiable queries reach 81.6% and 91.7% under two evaluation settings. The framework maintains robust generalization to unseen international travel scenarios and novel domains without task-specific fine-tuning.
📝 Abstract
Large Language Models (LLMs) struggle to directly generate correct plans for complex multi-constraint planning problems, even with self-verification and self-critique. For example, a U.S. domestic travel planning benchmark TravelPlanner was proposed in Xie et al. (2024), where the best LLM OpenAI o1-preview can only find viable travel plans with a 10% success rate given all needed information. In this work, we tackle this by proposing an LLM-based planning framework that formalizes and solves complex multi-constraint planning problems as constrained satisfiability problems, which are further consumed by sound and complete satisfiability solvers. We start with TravelPlanner as the primary use case and show that our framework achieves a success rate of 93.9% and is effective with diverse paraphrased prompts. More importantly, our framework has strong zero-shot generalizability, successfully handling unseen constraints in our newly created unseen international travel dataset and generalizing well to new fundamentally different domains. Moreover, when user input queries are infeasible, our framework can identify the unsatisfiable core, provide failure reasons, and offers personalized modification suggestions. We show that our framework can modify and solve for an average of 81.6% and 91.7% unsatisfiable queries from two datasets and prove with ablations that all key components of our framework are effective and necessary. Project page: https://sites.google.com/view/llm-rwplanning.