EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

In verifiable reward reinforcement learning (RLVR), large language models (LLMs) frequently suffer from over-exploitation, leading to policy entropy collapse and diminished exploration. To address this, we propose a two-stage rollout framework coupled with an adaptive policy forgetting mechanism: the first stage generates diverse candidate responses, while the second stage applies “post-sample forgetting” to disrupt self-reinforcing dominant behavioral patterns; concurrently, policy entropy is dynamically regulated to enable lightweight, differentiable exploration enhancement. Evaluated on five reasoning benchmarks, our method significantly outperforms GRPO, achieving average relative performance gains of 24.3%, 33.0%, and 10.4% on Qwen2.5-3B, Llama3.2-3B-Instruct, and Qwen3-8B-Base, respectively. This work constitutes the first systematic integration of policy forgetting into the LLM-RLVR paradigm, effectively broadening the output space exploration scope.

Technology Category

Application Category

📝 Abstract

Balancing exploration and exploitation remains a central challenge in reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs). Current RLVR methods often overemphasize exploitation, leading to entropy collapse, diminished exploratory capacity, and ultimately limited performance gains. Although techniques that increase policy stochasticity can promote exploration, they frequently fail to escape dominant behavioral modes. This creates a self-reinforcing loop-repeatedly sampling and rewarding dominant modes-that further erodes exploration. We introduce Exploration-Enhanced Policy Optimization (EEPO), a framework that promotes exploration via two-stage rollouts with adaptive unlearning. In the first stage, the model generates half of the trajectories; it then undergoes a lightweight unlearning step to temporarily suppress these sampled responses, forcing the second stage to explore different regions of the output space. This sample-then-forget mechanism disrupts the self-reinforcing loop and promotes wider exploration during rollouts. Across five reasoning benchmarks, EEPO outperforms GRPO, achieving average relative gains of 24.3% on Qwen2.5-3B, 33.0% on Llama3.2-3B-Instruct, and 10.4% on Qwen3-8B-Base.

Problem

Research questions and friction points this paper is trying to address.

Balancing exploration and exploitation in RLVR for LLMs

Addressing entropy collapse and limited exploratory capacity

Disrupting self-reinforcing loops of dominant behavioral modes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage rollouts with adaptive unlearning

Sample-then-forget mechanism disrupts reinforcement loop

Lightweight unlearning suppresses sampled responses temporarily

🔎 Similar Papers

Reusing Historical Trajectories in Natural Policy Gradient via Importance Sampling: Convergence and Convergence Rate