Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Gradient-Regulated Policy Optimization (GRPO) for reinforcement learning–based fine-tuning of large language models (LLMs) incurs prohibitively high computational costs. Method: This work introduces the first predictive scaling law framework tailored to GRPO training of large reasoning models, modeling training dynamics as a function of model size, initial performance, and training progress. It empirically identifies a universal three-phase evolution—“slow start,” “rapid improvement,” and “plateau”—and fits reward trajectories across Llama and Qwen models (3B/8B) to validate cross-model generalizability. Contribution/Results: A key finding is that reward gain asymptotically vanishes after one epoch, enabling early stopping without performance degradation. The framework provides quantifiable, generalizable termination criteria for efficient LLM reasoning fine-tuning, substantially reducing computational overhead while preserving inference capability.

Technology Category

Application Category

📝 Abstract

Fine-tuning large language models (LLMs) for reasoning tasks using reinforcement learning methods like Group Relative Policy Optimization (GRPO) is computationally expensive. To address this, we propose a predictive framework that models training dynamics and helps optimize resource usage. Through experiments on Llama and Qwen models (3B 8B), we derive an empirical scaling law based on model size, initial performance, and training progress. This law predicts reward trajectories and identifies three consistent training phases: slow start, rapid improvement, and plateau. We find that training beyond certain number of an epoch offers little gain, suggesting earlier stopping can significantly reduce compute without sacrificing performance. Our approach generalizes across model types, providing a practical guide for efficient GRPO-based fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

Optimize resource usage in GRPO training of LLMs

Predict reward trajectories via empirical scaling laws

Identify optimal stopping points to reduce compute costs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Predictive framework models training dynamics

Empirical scaling law based on key factors

Identifies three consistent training phases

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting