Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Large language models (LLMs) rely on chain-of-thought (CoT) reasoning, but their autoregressive, token-level generation lacks global planning capability, leading to redundant, incoherent, or erroneous reasoning. Method: We propose PTA-GRPO, a novel “Plan-Then-Act” two-stage framework: (1) a planning stage that generates high-level, structured reasoning plans; and (2) an execution stage that performs fine-grained CoT reasoning guided by the plan. We further introduce Guided-aware Grouped Relative Policy Optimization (GRPO), a reinforcement learning method that jointly optimizes both planning quality and final answer accuracy. Training integrates CoT distillation, supervised fine-tuning, and multi-stage optimization. Contribution/Results: PTA-GRPO achieves consistent and significant improvements across multiple LLMs on rigorous mathematical reasoning benchmarks—including MATH, AIME 2024/2025, and AMC—demonstrating that explicit, structured planning critically enhances reasoning coherence and accuracy.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated remarkable reasoning abilities in complex tasks, often relying on Chain-of-Thought (CoT) reasoning. However, due to their autoregressive token-level generation, the reasoning process is largely constrained to local decision-making and lacks global planning. This limitation frequently results in redundant, incoherent, or inaccurate reasoning, which significantly degrades overall performance. Existing approaches, such as tree-based algorithms and reinforcement learning (RL), attempt to address this issue but suffer from high computational costs and often fail to produce optimal reasoning trajectories. To tackle this challenge, we propose Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization PTA-GRPO, a two-stage framework designed to improve both high-level planning and fine-grained CoT reasoning. In the first stage, we leverage advanced LLMs to distill CoT into compact high-level guidance, which is then used for supervised fine-tuning (SFT). In the second stage, we introduce a guidance-aware RL method that jointly optimizes the final output and the quality of high-level guidance, thereby enhancing reasoning effectiveness. We conduct extensive experiments on multiple mathematical reasoning benchmarks, including MATH, AIME2024, AIME2025, and AMC, across diverse base models such as Qwen2.5-7B-Instruct, Qwen3-8B, Qwen3-14B, and LLaMA3.2-3B. Experimental results demonstrate that PTA-GRPO consistently achieves stable and significant improvements across different models and tasks, validating its effectiveness and generalization.

Problem

Research questions and friction points this paper is trying to address.

Enhances global planning in LLM reasoning processes

Reduces redundant and incoherent reasoning in complex tasks

Optimizes high-level guidance with reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework with planning then action

High-level guidance distillation for supervised fine-tuning

Guidance-aware reinforcement learning for reasoning optimization

🔎 Similar Papers

No similar papers found.