🤖 AI Summary
Insufficient exploration in reinforcement learning often traps policies in suboptimal trajectories, hindering the enhancement of reasoning capabilities in large language models. This work proposes NudgeRL, a framework that leverages lightweight contextual guidance to generate diverse reasoning paths and efficiently transfers exploration knowledge back to the base policy through reward decomposition and behavioral distillation. Operating without expert supervision, NudgeRL drives structured, diversity-oriented exploration that pushes the policy beyond its comfort zone. Evaluated on five mathematical reasoning benchmarks, NudgeRL substantially outperforms expert-guided RL baselines and surpasses standard GRPO with significantly fewer rollouts—achieving performance equivalent to that of a model eight times larger.
📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute-force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity-driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy-level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter- and intra-context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle-guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context-driven exploration can serve as an efficient and scalable alternative to both brute-force rollout scaling and feasibility-oriented methods based on privileged information. Our code is available at https://github.com/tally0818/NudgeRL.