Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models

📅 2025-07-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Gradient-Regulated Policy Optimization (GRPO) for reinforcement learning–based fine-tuning of large language models (LLMs) incurs prohibitively high computational costs. Method: This work introduces the first predictive scaling law framework tailored to GRPO training of large reasoning models, modeling training dynamics as a function of model size, initial performance, and training progress. It empirically identifies a universal three-phase evolution—“slow start,” “rapid improvement,” and “plateau”—and fits reward trajectories across Llama and Qwen models (3B/8B) to validate cross-model generalizability. Contribution/Results: A key finding is that reward gain asymptotically vanishes after one epoch, enabling early stopping without performance degradation. The framework provides quantifiable, generalizable termination criteria for efficient LLM reasoning fine-tuning, substantially reducing computational overhead while preserving inference capability.

Technology Category

Application Category

📝 Abstract
Fine-tuning large language models (LLMs) for reasoning tasks using reinforcement learning methods like Group Relative Policy Optimization (GRPO) is computationally expensive. To address this, we propose a predictive framework that models training dynamics and helps optimize resource usage. Through experiments on Llama and Qwen models (3B 8B), we derive an empirical scaling law based on model size, initial performance, and training progress. This law predicts reward trajectories and identifies three consistent training phases: slow start, rapid improvement, and plateau. We find that training beyond certain number of an epoch offers little gain, suggesting earlier stopping can significantly reduce compute without sacrificing performance. Our approach generalizes across model types, providing a practical guide for efficient GRPO-based fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

Optimize resource usage in GRPO training of LLMs
Predict reward trajectories via empirical scaling laws
Identify optimal stopping points to reduce compute costs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Predictive framework models training dynamics
Empirical scaling law based on key factors
Identifies three consistent training phases
🔎 Similar Papers
No similar papers found.
D
Datta Nimmaturi
Nutanix Inc.
Vaishnavi Bhargava
Vaishnavi Bhargava
University of Wisconsin Madison || Birla Institute of Technology & Science, Pilani, India
LLMFairnessLIMENLPDeep learning
R
Rajat Ghosh
Nutanix Inc.
J
Johnu George
Nutanix Inc.
D
Debojyoti Dutta
Nutanix Inc.