RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work investigates how reinforcement learning (RL) and supervised fine-tuning (SFT) differentially shape the mathematical reasoning capabilities of large language models. We propose a novel analytical framework grounded in the topological structure of reasoning paths, enabling dual-level quantification—over trajectories and individual steps—via reasoning graphs. Key metrics include node visit frequency, degree, and, critically, the decay rate of betweenness centrality. Our findings reveal, for the first time: (1) RL dramatically prunes erroneous reasoning paths, concentrating functional reasoning into highly compact subgraphs—increasing the betweenness decay rate by ~2.5×; (2) SFT expands correct reasoning paths, yielding more uniform reasoning distributions—reducing the decay rate to ~1/3 of its pre-SFT value. These complementary mechanisms provide an interpretable, structural account of the widely adopted “SFT→RL” two-stage training paradigm, uncovering principled, topology-driven patterns underlying reasoning capability evolution.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable rewards (RLVR) and supervised fine-tuning (SFT) on reasoning traces to improve their reasoning abilities. However, how these methods shape reasoning capabilities remains largely elusive. Going beyond an accuracy-based investigation of how these two components sculpt the reasoning process, this paper introduces a novel analysis framework that quantifies reasoning paths and captures their qualitative changes under each training process (with models of 1.5B, 7B, and 14B parameters on mathematical domains). Specifically, we investigate the reasoning process at two levels of granularity: the trajectory-level, which examines complete reasoning outputs, and the step-level, which analyzes reasoning graphs whose nodes correspond to individual reasoning steps. Notably, clustering of unique reasoning trajectories shows complementary effects: RL compresses incorrect trajectories, whereas SFT expands correct ones. Step-level analysis reveals that RL steepens (about 2.5 times), while SFT flattens (reduced to about one-third), the decay rates of node visitation frequency, degree, and betweenness centrality distributions in the reasoning graph. This indicates that RL concentrates reasoning functionality into a small subset of steps, while SFT homogenizes it across many steps. Furthermore, by evaluating the reasoning graph topologies from multiple perspectives, we delineate the shared and distinct characteristics of RL and SFT. Our work presents a novel reasoning path perspective that explains why the current best practice of two-stage training, with SFT followed by RL, is successful, and offers practical implications for data construction and more efficient learning approaches.

Problem

Research questions and friction points this paper is trying to address.

Investigates how RL and SFT training shape LLM reasoning processes differently

Analyzes reasoning path changes at trajectory and step levels using mathematical domains

Explains why SFT followed by RL two-stage training achieves optimal reasoning performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

RL compresses reasoning trajectories

SFT expands correct reasoning paths

Novel framework analyzes reasoning graph topologies

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting