PGPO: Enhancing Agent Reasoning via Pseudocode-style Planning Guided Preference Optimization

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing LLM-based agents rely on natural-language planning, resulting in verbose reasoning and poor generalization. Method: We propose P-code Planning—a structured, pseudocode-style planning formalism that replaces unstructured text with executable logical constructs to enhance plan interpretability and cross-task transferability. We further introduce Planning-Guided Preference Optimization (PGPO), the first framework to integrate pseudocode planning into LLM reasoning, employing a dual-objective reward function that decouples planning quality from action execution optimization. Our approach combines pseudocode generation, multi-objective reward modeling, preference-based reinforcement learning (a PPO variant), and structured chain-of-thought distillation. Results: On mainstream agent benchmarks, our method significantly outperforms SOTA: action error rate and omission rate decrease by 23.6% and 18.4%, respectively; planning efficiency improves by 37%; and generalization across unseen tasks is substantially enhanced.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM) agents have demonstrated impressive capabilities in handling complex interactive problems. Existing LLM agents mainly generate natural language plans to guide reasoning, which is verbose and inefficient. NL plans are also tailored to specific tasks and restrict agents' ability to generalize across similar tasks. To this end, we explore pseudocode-style plans (P-code Plan) to capture the structural logic of reasoning. We find that P-code Plan empowers LLM agents with stronger generalization ability and more efficiency. Inspired by this finding, we propose a pseudocode-style Planning Guided Preference Optimization method called PGPO for effective agent learning. With two planning-oriented rewards, PGPO further enhances LLM agents' ability to generate high-quality P-code Plans and subsequent reasoning. Experiments show that PGPO achieves superior performance on representative agent benchmarks and outperforms the current leading baselines. Analyses reveal the advantage of PGPO in reducing action errors and omissions during reasoning.

Problem

Research questions and friction points this paper is trying to address.

Enhancing agent reasoning via pseudocode-style planning

Improving generalization across similar tasks with P-code Plans

Reducing action errors and omissions during reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pseudocode-style plans for reasoning

Implements planning-guided preference optimization

Enhances generalization with planning-oriented rewards

🔎 Similar Papers

Policy Learning with a Language Bottleneck