Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

To address exploration-exploitation imbalance and entropy collapse in reinforcement learning for long-horizon, sparse-reward agent tasks with large language models (LLMs), this paper proposes SPEAR, a progressive self-imitation learning framework. SPEAR introduces a curriculum-based self-imitation mechanism that jointly leverages intrinsic rewards and trajectory-level entropy regularization to dynamically balance skill-level and action-level exploration during policy evolution—thereby avoiding distributional shift and training instability induced by conventional entropy maximization. Key technical components include a prioritized replay buffer, advantage re-calibration, and covariance truncation between action probabilities and advantages. Experiments demonstrate that SPEAR significantly improves success rates and sample efficiency on complex tool-use tasks, enables more robust policy convergence, and effectively mitigates both entropy collapse and policy divergence.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL training instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a curriculum-based self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL framework, where a replay buffer stores self-generated promising trajectories for off-policy update, by gradually steering the policy evolution within a well-balanced range of entropy across stages. Specifically, our approach incorporates a curriculum to manage the exploration process, utilizing intrinsic rewards to foster skill-level exploration and facilitating action-level exploration through SIL. At first, the auxiliary tool call reward plays a critical role in the accumulation of tool-use skills, enabling broad exposure to the unfamiliar distributions of the environment feedback with an upward entropy trend. As training progresses, self-imitation gets strengthened to exploit existing successful patterns from replayed experiences for comparative action-level exploration, accelerating solution iteration without unbounded entropy growth. To further stabilize training, we recalibrate the advantages of experiences in the replay buffer to address the potential policy drift. Reugularizations such as the clipping of tokens with high covariance between probability and advantage are introduced to the trajectory-level entropy control to curb over-confidence.

Problem

Research questions and friction points this paper is trying to address.

Balancing exploration and exploitation in agentic reinforcement learning training

Addressing policy entropy collapse and divergence in multi-turn tasks

Stabilizing RL training through curriculum-based self-imitation learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum-based self-imitation learning for agent training

Intrinsic rewards foster skill-level exploration progressively

Recalibrated advantages stabilize training against policy drift

🔎 Similar Papers

No similar papers found.