Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) rely on chain-of-thought (CoT) reasoning, but their autoregressive, token-level generation lacks global planning capability, leading to redundant, incoherent, or erroneous reasoning. Method: We propose PTA-GRPO, a novel “Plan-Then-Act” two-stage framework: (1) a planning stage that generates high-level, structured reasoning plans; and (2) an execution stage that performs fine-grained CoT reasoning guided by the plan. We further introduce Guided-aware Grouped Relative Policy Optimization (GRPO), a reinforcement learning method that jointly optimizes both planning quality and final answer accuracy. Training integrates CoT distillation, supervised fine-tuning, and multi-stage optimization. Contribution/Results: PTA-GRPO achieves consistent and significant improvements across multiple LLMs on rigorous mathematical reasoning benchmarks—including MATH, AIME 2024/2025, and AMC—demonstrating that explicit, structured planning critically enhances reasoning coherence and accuracy.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated remarkable reasoning abilities in complex tasks, often relying on Chain-of-Thought (CoT) reasoning. However, due to their autoregressive token-level generation, the reasoning process is largely constrained to local decision-making and lacks global planning. This limitation frequently results in redundant, incoherent, or inaccurate reasoning, which significantly degrades overall performance. Existing approaches, such as tree-based algorithms and reinforcement learning (RL), attempt to address this issue but suffer from high computational costs and often fail to produce optimal reasoning trajectories. To tackle this challenge, we propose Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization PTA-GRPO, a two-stage framework designed to improve both high-level planning and fine-grained CoT reasoning. In the first stage, we leverage advanced LLMs to distill CoT into compact high-level guidance, which is then used for supervised fine-tuning (SFT). In the second stage, we introduce a guidance-aware RL method that jointly optimizes the final output and the quality of high-level guidance, thereby enhancing reasoning effectiveness. We conduct extensive experiments on multiple mathematical reasoning benchmarks, including MATH, AIME2024, AIME2025, and AMC, across diverse base models such as Qwen2.5-7B-Instruct, Qwen3-8B, Qwen3-14B, and LLaMA3.2-3B. Experimental results demonstrate that PTA-GRPO consistently achieves stable and significant improvements across different models and tasks, validating its effectiveness and generalization.
Problem

Research questions and friction points this paper is trying to address.

Enhances global planning in LLM reasoning processes
Reduces redundant and incoherent reasoning in complex tasks
Optimizes high-level guidance with reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework with planning then action
High-level guidance distillation for supervised fine-tuning
Guidance-aware reinforcement learning for reasoning optimization
🔎 Similar Papers
No similar papers found.
Z
Zhihao Dou
Case Western Reserve University
Qinjian Zhao
Qinjian Zhao
Kean University
Zhongwei Wan
Zhongwei Wan
The Ohio State University, PhD student
LLMMultimodalNLP
D
Dinggen Zhang
Kean University
W
Weida Wang
Fudan University
T
Towsif Raiyan
Case Western Reserve University
B
Benteng Chen
The University of Hong Kong
Q
Qingtao Pan
Case Western Reserve University
Y
Yang Ouyang
North Carolina State University
Z
Zhiqiang Gao
Kean University
S
Shufei Zhang
Shanghai Artificial Intelligence Laboratory
Sumon Biswas
Sumon Biswas
Assistant Professor, Case Western Reserve University
Software EngineeringAISE4AIProgramming Languages