Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them

📅 2025-07-13

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This study investigates the distinct mechanisms by which reinforcement learning (RL) and supervised fine-tuning (SFT) shape the reasoning capabilities of large language models (LLMs) in mathematical and code domains. Using identical model architectures and hyperparameters, we comparatively analyze training dynamics between GRPO—a policy-gradient-based RL method—and SFT, employing layer-wise parameter update analysis to uncover underlying mechanisms. Results show that RL primarily amplifies pre-existing reasoning abilities with strong knowledge retention but limited in-domain gains; in contrast, SFT yields substantial in-domain performance improvements yet induces cross-domain capability degradation due to weight-overwriting updates. This work provides the first systematic, parameter-evolution-based explanation of the phenomenon “RL amplifies old capabilities, while SFT replaces old skills,” offering both theoretical insights and empirical evidence for controllable LLM reasoning training.

Technology Category

Application Category

📝 Abstract

Training large language models (LLMs) for reasoning via maths and code datasets has become a major new focus in LLM post-training. Two particularly popular approaches are reinforcement learning (RL) and supervised fine-tuning (SFT), but their training dynamics are poorly understood. We present a comparative analysis of RL and SFT on the same maths problems with the same model and similar hyperparameters. We find that RL yields minor in-domain gains on maths and slight degradation on knowledge-intensive benchmarks like MMLU, while both trends are more pronounced in SFT. We also analyse model parameters across checkpoints, observing that both algorithms modify query and key weights the most. Meanwhile, SFT exhibits greater updates and also affects mid-layer MLPs more, leading us to hypothesise that this may have caused the out-of-domain degradation. We therefore investigate whether freezing parts of the model during training can mitigate the reduced performance on knowledge-intensive benchmarks. However, our results are inconclusive, with benefits on GPQA:Diamond and degradation on other benchmarks. Taken together, our observations provide a preliminary indication for why RL amplifies existing capabilities, while SFT replaces old skills with new ones.

Problem

Research questions and friction points this paper is trying to address.

Compare RL and SFT effects on LLM reasoning

Analyze parameter changes in RL vs SFT training

Investigate freezing layers to mitigate SFT degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comparative analysis of RL and SFT

Freezing parts to mitigate degradation

RL amplifies, SFT replaces capabilities

🔎 Similar Papers

No similar papers found.