Reinforcement Learning for Reasoning in Large Language Models with One Training Example

📅 2025-04-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the bottleneck of requiring large-scale annotated data for enhancing large language models’ mathematical reasoning capabilities, this paper proposes one-shot verifiable-reward reinforcement learning (1-shot RLVR), enabling efficient optimization using only a single math reasoning example accompanied by a verification procedure. Built upon the GRPO/PPO framework, the method integrates verifiable reward modeling, entropy-regularized exploration incentives, and policy gradient optimization. Empirically, we first demonstrate that 1-shot RLVR achieves performance comparable to thousand-sample supervised fine-tuning. We further uncover novel phenomena—including post-saturation generalization, cross-domain transfer, and introspection-enhanced reasoning—and identify the critical role of policy gradient loss, distinguishing it from grokking mechanisms. On Qwen2.5-Math-1.5B, MATH500 accuracy improves from 36.0% to 73.6%, and the average score across six benchmarks reaches 35.7%; notably, entropy regularization alone yields a 27.4% absolute gain.

Technology Category

Application Category

📝 Abstract
We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6%, and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7%. This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which includes the aforementioned example. Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples (many of which yield approximately 30% or greater improvement on MATH500 when employed as a single training example). In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the"grokking"phenomenon. We also show the critical role of promoting exploration (e.g., by adding entropy loss with an appropriate coefficient) in 1-shot RLVR training. As a bonus, we observe that applying entropy loss alone, without any outcome reward, significantly enhances Qwen2.5-Math-1.5B's performance on MATH500 by 27.4%. These findings can inspire future work on RLVR data efficiency and encourage a re-examination of both recent progress and the underlying mechanisms in RLVR. Our code, model, and data are open source at https://github.com/ypwang61/One-Shot-RLVR
Problem

Research questions and friction points this paper is trying to address.

Enhancing math reasoning in LLMs with one training example
Improving performance across multiple math benchmarks efficiently
Exploring reinforcement learning mechanisms for better generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

1-shot RLVR enhances math reasoning in LLMs
Policy gradient loss drives 1-shot RLVR effectiveness
Entropy loss boosts exploration in 1-shot RLVR
🔎 Similar Papers
No similar papers found.