Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

📅 2025-10-05

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the high gradient noise and unstable policy updates during early-stage reinforcement learning (RL) fine-tuning of large language models (LLMs), this paper proposes a three-stage policy optimization framework. First, a repositioning mechanism is introduced to correct policy drift before each update, decoupling slow, precise correction from rapid exploration. Second, the framework integrates phased slow–fast policy updates, group-relative policy optimization (GRPO)-based gradient computation, and intra-batch cyclic trajectory generation. Crucially, the original optimization objective remains unchanged. The framework significantly improves training stability and efficiency. On mathematical reasoning benchmarks, it achieves an average performance gain of +2.80 points over GRPO, while reducing sampling requirements and training time by up to 4.93× and 4.19×, respectively.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93 exttimes{} fewer rollouts and a 4.19 exttimes{} reduction in wall-clock time to match GRPO's best accuracy.

Problem

Research questions and friction points this paper is trying to address.

Addresses unstable updates in RL for LLM reasoning

Reduces inefficient exploration during early training phases

Minimizes rollout requirements while accelerating convergence speed

Innovation

Methods, ideas, or system contributions that make the work stand out.

Slow-Fast Policy Optimization with reposition-before-update design

Decomposes steps into fast trajectory and slow correction

Reduces rollouts and accelerates convergence in reasoning

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting