ESPO: Early-Stopping Proximal Policy Optimization

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in reinforcement learning with large language models, where early reasoning errors often lead to invalid rollouts and corrupted advantage estimates. The authors propose ESPO, a method that integrates an online failure-detection mechanism into the PPO framework by leveraging smooth cumulative regret derived from logits already available during sampling. This mechanism dynamically identifies failing trajectories and terminates them early, treating them as absorbing states with terminal rewards, thereby concentrating negative temporal difference errors near the actual failure points. Requiring no additional reward models or human annotations, ESPO significantly improves training efficiency and performance, achieving state-of-the-art accuracy of 46.28% on AIME 2024, 85.83% on AMC 2023, and 87.42% on MATH-500—outperforming standard PPO while reducing rollout token consumption by over 20%.
📝 Abstract
When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum horizon, spending compute on tokens that never receive positive reward and polluting advantage estimates with post-failure noise. We propose ESPO (Early-Stopping Proximal Policy Optimization), which detects trajectory failure on-the-fly and terminates rollouts early. At each generation step, ESPO computes a surrogate regret using only the logits already computed during sampling, and terminates when the smoothed cumulative regret significantly exceeds its estimated values. Truncated trajectories are treated as absorbing failure states with a terminal reward, concentrating negative temporal-difference (TD) errors near the detected failure step without any additional reward model or human annotation. On DeepSeek-R1-Distill-Qwen-7B trained for mathematical reasoning, ESPO surpasses PPO on AIME~2024 (46.28% vs. 45.25%), AMC~2023 (85.83% vs. 82.94%), and MATH-500 (87.42% vs. 85.43%), while saving more than 20% rollout tokens cumulatively.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
large language models
trajectory failure
early stopping
reward estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Early-Stopping
Proximal Policy Optimization
Surrogate Regret
Trajectory Truncation
Temporal-Difference Error
🔎 Similar Papers
No similar papers found.