Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
This work addresses the prevailing focus on loss function design in large language model post-training while overlooking how the distribution of training states—defined as prompts concatenated with generated prefixes—affects both performance and knowledge retention. The authors propose a unified analytical framework that conceptualizes supervised fine-tuning (SFT), online reinforcement learning (RL), and online policy distillation (OPD) as processes of shaping state distributions. Controlled experiments on Qwen3-0.6B-Base reveal that lightweight SFT effectively enhances GSM8K performance with minimal forgetting; surprisingly, OPD guided by a degraded teacher model surpasses the original teacher; and lightweight online RL significantly boosts mathematical reasoning while preserving prior knowledge. These findings underscore the critical role of the source and locality of training states in determining model capabilities and catastrophic forgetting.
📝 Abstract
Large language model post-training methods such as supervised fine-tuning (SFT), reinforcement learning (RL), and distillation are often analyzed through their loss functions: maximum likelihood, policy gradients, forward KL, reverse KL, or related objective-level variants. We study a complementary factor: the state distribution on which supervision is applied. For an autoregressive policy, a state is a prompt plus generated prefix. SFT trains on fixed dataset states, while RL and on-policy distillation (OPD) train on states induced by the current learner. We formalize post-training as state-distribution shaping and run a controlled smallscale study using Qwen3-0.6B-Base on GSM8K, with TruthfulQA and MMLU as retention evaluations. Our results show three phenomena. First, a mild SFT run improves GSM8K with little forgetting, while a stress SFT run causes substantial retention loss. Second, OPD from a degraded SFT teacher surpasses that teacher on GSM8K, TruthfulQA, and MMLU, despite using the teacher as its only supervision source. Third, a lightweight on-policy RL run improves GSM8K while preserving retention. These results support a state-centric view of post-training: the source and locality of training states can be as important as the form of the supervision signal.
Problem

Research questions and friction points this paper is trying to address.

post-training
state distribution
supervised fine-tuning
reinforcement learning
on-policy distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

state distribution
post-training
on-policy distillation
supervised fine-tuning
reinforcement learning
🔎 Similar Papers
No similar papers found.