π€ AI Summary
Existing LLM-based agents rely on natural-language planning, resulting in verbose reasoning and poor generalization. Method: We propose P-code Planningβa structured, pseudocode-style planning formalism that replaces unstructured text with executable logical constructs to enhance plan interpretability and cross-task transferability. We further introduce Planning-Guided Preference Optimization (PGPO), the first framework to integrate pseudocode planning into LLM reasoning, employing a dual-objective reward function that decouples planning quality from action execution optimization. Our approach combines pseudocode generation, multi-objective reward modeling, preference-based reinforcement learning (a PPO variant), and structured chain-of-thought distillation. Results: On mainstream agent benchmarks, our method significantly outperforms SOTA: action error rate and omission rate decrease by 23.6% and 18.4%, respectively; planning efficiency improves by 37%; and generalization across unseen tasks is substantially enhanced.
π Abstract
Large Language Model (LLM) agents have demonstrated impressive capabilities in handling complex interactive problems. Existing LLM agents mainly generate natural language plans to guide reasoning, which is verbose and inefficient. NL plans are also tailored to specific tasks and restrict agents' ability to generalize across similar tasks. To this end, we explore pseudocode-style plans (P-code Plan) to capture the structural logic of reasoning. We find that P-code Plan empowers LLM agents with stronger generalization ability and more efficiency. Inspired by this finding, we propose a pseudocode-style Planning Guided Preference Optimization method called PGPO for effective agent learning. With two planning-oriented rewards, PGPO further enhances LLM agents' ability to generate high-quality P-code Plans and subsequent reasoning. Experiments show that PGPO achieves superior performance on representative agent benchmarks and outperforms the current leading baselines. Analyses reveal the advantage of PGPO in reducing action errors and omissions during reasoning.