Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

πŸ“… 2025-10-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the high gradient noise and unstable policy updates during early-stage reinforcement learning (RL) fine-tuning of large language models (LLMs), this paper proposes a three-stage policy optimization framework. First, a repositioning mechanism is introduced to correct policy drift before each update, decoupling slow, precise correction from rapid exploration. Second, the framework integrates phased slow–fast policy updates, group-relative policy optimization (GRPO)-based gradient computation, and intra-batch cyclic trajectory generation. Crucially, the original optimization objective remains unchanged. The framework significantly improves training stability and efficiency. On mathematical reasoning benchmarks, it achieves an average performance gain of +2.80 points over GRPO, while reducing sampling requirements and training time by up to 4.93Γ— and 4.19Γ—, respectively.

Technology Category

Application Category

πŸ“ Abstract
Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93 exttimes{} fewer rollouts and a 4.19 exttimes{} reduction in wall-clock time to match GRPO's best accuracy.
Problem

Research questions and friction points this paper is trying to address.

Addresses unstable updates in RL for LLM reasoning
Reduces inefficient exploration during early training phases
Minimizes rollout requirements while accelerating convergence speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Slow-Fast Policy Optimization with reposition-before-update design
Decomposes steps into fast trajectory and slow correction
Reduces rollouts and accelerates convergence in reasoning
πŸ”Ž Similar Papers
No similar papers found.