Self-Hinting Language Models Enhance Reinforcement Learning

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the challenge of training collapse in Group Relative Policy Optimization (GRPO) under sparse terminal rewards, where trajectories within a group share identical rewards, leading to vanishing relative advantages and stalled learning. To mitigate this, the authors propose SAGE, a framework that leverages self-generated compact prompts—such as plans or subgoals—as privileged supervisory signals during training to enhance intra-group policy diversity and prevent GRPO update failure. Notably, SAGE requires no prompts at test time, enabling deployment without privileged information. Built upon on-policy reinforcement learning, SAGE integrates a dynamic self-prompting mechanism, conditional generation, and GRPO optimization to deliver an adaptive curriculum aligned with the learner’s evolving capabilities. Evaluations across six benchmarks with three large language models demonstrate SAGE’s consistent superiority over GRPO, yielding gains of +2.0 for Llama-3.2-3B-Instruct, +1.2 for Qwen2.5-7B-Instruct, and +1.3 for Qwen3-4B-Instruct.

Technology Category

Application Category

📝 Abstract

Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt $x$, the model samples a compact hint $h$ (e.g., a plan or decomposition) and then generates a solution $\tau$ conditioned on $(x,h)$. Crucially, the task reward $R(x,\tau)$ is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set $h=\varnothing$ and deploy the no-hint policy without any privileged information. Moreover, sampling diverse self-hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct and +1.3 on Qwen3-4B-Instruct. The code is available at https://github.com/BaohaoLiao/SAGE.

Problem

Research questions and friction points this paper is trying to address.

sparse rewards

relative policy optimization

reward collapse

large language models

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-hinting

reinforcement learning

GRPO