🤖 AI Summary
Gradient-Regulated Policy Optimization (GRPO) for reinforcement learning–based fine-tuning of large language models (LLMs) incurs prohibitively high computational costs. Method: This work introduces the first predictive scaling law framework tailored to GRPO training of large reasoning models, modeling training dynamics as a function of model size, initial performance, and training progress. It empirically identifies a universal three-phase evolution—“slow start,” “rapid improvement,” and “plateau”—and fits reward trajectories across Llama and Qwen models (3B/8B) to validate cross-model generalizability. Contribution/Results: A key finding is that reward gain asymptotically vanishes after one epoch, enabling early stopping without performance degradation. The framework provides quantifiable, generalizable termination criteria for efficient LLM reasoning fine-tuning, substantially reducing computational overhead while preserving inference capability.
📝 Abstract
Fine-tuning large language models (LLMs) for reasoning tasks using reinforcement learning methods like Group Relative Policy Optimization (GRPO) is computationally expensive. To address this, we propose a predictive framework that models training dynamics and helps optimize resource usage. Through experiments on Llama and Qwen models (3B 8B), we derive an empirical scaling law based on model size, initial performance, and training progress. This law predicts reward trajectories and identifies three consistent training phases: slow start, rapid improvement, and plateau. We find that training beyond certain number of an epoch offers little gain, suggesting earlier stopping can significantly reduce compute without sacrificing performance. Our approach generalizes across model types, providing a practical guide for efficient GRPO-based fine-tuning.