Learn Hard Problems During RL with Reference Guided Fine-tuning

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the challenge of sparse rewards in reinforcement learning (RL) for mathematical reasoning, which often impedes the generation of effective solution trajectories. To mitigate this issue, the authors propose Reference-Guided Fine-Tuning (ReGFT), a method that, prior to RL, synthesizes high-quality positive trajectories by integrating human reference solutions with model-generated reasoning paths. These trajectories are designed to be both instructionally aligned and compatible with the model’s intrinsic capabilities, and are leveraged through supervised fine-tuning. ReGFT effectively alleviates reward sparsity, substantially increasing the number of solvable problems and enhancing subsequent RL efficiency. Evaluated on the AIME24, AIME25, and BeyondAIME benchmarks, ReGFT not only significantly improves supervised accuracy but also accelerates training convergence and surpasses previous performance ceilings.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) for mathematical reasoning can suffer from reward sparsity: for challenging problems, LLM fails to sample any correct trajectories, preventing RL from receiving meaningful positive feedback. At the same time, there often exist human-written reference solutions along with the problem (e.g., problems from AoPS), but directly fine-tuning on these solutions offers no benefit because models often cannot imitate human proofs that lie outside their own reasoning distribution. We introduce Reference-Guided Fine-Tuning (ReGFT), a simple and effective method that utilizes human-written reference solutions to synthesize positive trajectories on hard problems and train on them before RL. For each problem, we provide the model with a partial reference solution and let it generate its own reasoning trace, ensuring the resulting trajectories remain in the model's reasoning space while still benefiting from reference guidance. Fine-tuning on these reference-guided trajectories increases the number of solvable problems and produces a checkpoint that receives more positive rewards during RL. Across three benchmarks (AIME24, AIME25, BeyondAIME), ReGFT consistently improves supervised accuracy, accelerates DAPO training, and raises the final performance plateau of RL. Our results show that ReGFT effectively overcomes reward sparsity and unlocks stronger RL-based mathematical reasoning.

Problem

Research questions and friction points this paper is trying to address.

reward sparsity

mathematical reasoning

reinforcement learning

reference solutions

trajectory generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reference-Guided Fine-Tuning

reward sparsity

mathematical reasoning