Code Driven Planning with Domain-Adaptive Critic

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the misalignment between general knowledge encoded in large language models (LLMs) and task-specific environmental requirements—leading to inaccurate planning when LLMs serve as AI agent planners—this paper proposes an efficient long-horizon planning framework. It replaces frequent LLM queries with executable, high-level code-based planning programs, significantly reducing computational overhead. A trainable, domain-adaptive critic is introduced to evaluate and filter candidate programs based on long-term reward consistency, thereby mitigating the limitations of myopic feedback. The core innovation lies in decoupling planning into two distinct phases—program generation and critic-based evaluation—and enabling environment-adaptive training of the critic. Evaluated on ALFWorld, NetHack, and StarCraft II construction tasks, the framework achieves an average success rate improvement of 23.33% while reducing LLM query volume by 91.27%.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have been widely adopted as task planners for AI agents in sequential decision-making problems, leveraging their extensive world knowledge. However, the gap between their general knowledge and environment-specific requirements often leads to inaccurate plans. To address this, existing approaches rely on frequent LLM queries to iteratively refine plans based on immediate environmental feedback, which incurs substantial query costs. However, this refinement is typically guided by short-term environmental feedback, limiting LLMs from developing plans aligned with long-term rewards. We propose Code Driven Planning with Domain-Adaptive Critic (CoPiC). Instead of relying on frequent queries, CoPiC employs LLMs to generate a diverse set of high-level planning programs, which iteratively produce and refine candidate plans. A trained domain-adaptive critic then evaluates these candidates and selects the one most aligned with long-term rewards for execution. Using high-level planning programs as planner and domain-adaptive critic as estimator, CoPiC improves planning while significantly reducing query costs. Results in ALFWorld, NetHack, and StarCraft II Unit Building show that CoPiC outperforms advanced LLM-based baselines, AdaPlanner and Reflexion, achieving an average (1) 23.33% improvement in success rate and (2) 91.27% reduction in query costs.
Problem

Research questions and friction points this paper is trying to address.

Bridging the gap between LLMs' general knowledge and environment-specific planning requirements
Reducing substantial query costs from frequent LLM interactions in iterative planning
Improving plan quality by aligning decisions with long-term rewards rather than short-term feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates diverse high-level planning programs
Uses domain-adaptive critic for long-term reward evaluation
Reduces LLM queries by refining candidate plans iteratively
🔎 Similar Papers
No similar papers found.
Z
Zikang Tian
SKL of Processors, Institute of Computing Technology, CAS, Beijing, China
Shaohui Peng
Shaohui Peng
Institute of Software Chinese Academy of Sciences
Embodied AIReinforcement Learning
D
Di Huang
SKL of Processors, Institute of Computing Technology, CAS, Beijing, China
Jiaming Guo
Jiaming Guo
Institute of Computing Technology, Chinese Academy of Sciences
Artificial intelligenceReinforcement Learning
Ruizhi Chen
Ruizhi Chen
SKL of Computer Science, Institute of Software, CAS, Beijing, China
R
Rui Zhang
SKL of Processors, Institute of Computing Technology, CAS, Beijing, China
Xishan Zhang
Xishan Zhang
Institute of Computing Technology of the Chinese Academy of Sciences
Y
Yuxuan Guo
SKL of Processors, Institute of Computing Technology, CAS, Beijing, China
Z
Zidong Du
Shanghai Innovation Center for Processor Technologies, SHIC, Shanghai, China
Q
Qi Guo
SKL of Processors, Institute of Computing Technology, CAS, Beijing, China
L
Ling Li
University of Chinese Academy of Sciences, Beijing, China
Yewen Pu
Yewen Pu
Nanyang Technological University
code generationinstruction followingcognitive sciencedataset curation
X
Xing Hu
Shanghai Innovation Center for Processor Technologies, SHIC, Shanghai, China
Yunji Chen
Yunji Chen
Institute of Computing Technology, Chinese Academy of Sciences
processor architecturemicroarchitecturemachine learning