🤖 AI Summary
This work addresses the challenges of inefficient exploration and unstable training in reinforcement learning for large language model reasoning and agent tasks, where trajectory-level sparse rewards often lead to penalizing effective prefixes and updates dominated by failed trajectories. To overcome these issues, the authors propose a Reflect-and-Retry mechanism that leverages natural language feedback to diagnose errors and restart from failure points, actively synthesizing high-quality trajectories. They introduce Pivotal Credit Assignment, which restricts policy updates to only the erroneous suffixes of trajectories, thereby improving learning efficiency. Additionally, Positive Amplification is employed to enhance sparse success signals, stabilizing off-policy training. Evaluated across multiple reasoning and agent tasks, the method achieves performance gains of 5%–52% over baselines, significantly improving both exploration efficiency and training stability.
📝 Abstract
Reinforcement learning drives recent advances in LLM reasoning and agentic capabilities, yet current approaches struggle with both exploration and exploitation. Exploration suffers from low success rates on difficult tasks and high costs of repeated rollouts from scratch. Exploitation suffers from coarse credit assignment and training instability: Trajectory-level rewards penalize valid prefixes for later errors, and failure-dominated groups overwhelm the few positive signals, leaving optimization without constructive direction. To this end, we propose R$^3$L, Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification. To synthesize high-quality trajectories, R$^3$L shifts from stochastic sampling to active synthesis via reflect-then-retry, leveraging language feedback to diagnose errors, transform failed attempts into successful ones, and reduce rollout costs by restarting from identified failure points. With errors diagnosed and localized, Pivotal Credit Assignment updates only the diverging suffix where contrastive signals exist, excluding the shared prefix from gradient update. Since failures dominate on difficult tasks and reflect-then-retry produces off-policy data, risking training instability, Positive Amplification upweights successful trajectories to ensure positive signals guide the optimization process. Experiments on agentic and reasoning tasks demonstrate 5\% to 52\% relative improvements over baselines while maintaining training stability. Our code is released at https://github.com/shiweijiezero/R3L.