Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the prevailing focus on loss function design in large language model post-training while overlooking how the distribution of training states—defined as prompts concatenated with generated prefixes—affects both performance and knowledge retention. The authors propose a unified analytical framework that conceptualizes supervised fine-tuning (SFT), online reinforcement learning (RL), and online policy distillation (OPD) as processes of shaping state distributions. Controlled experiments on Qwen3-0.6B-Base reveal that lightweight SFT effectively enhances GSM8K performance with minimal forgetting; surprisingly, OPD guided by a degraded teacher model surpasses the original teacher; and lightweight online RL significantly boosts mathematical reasoning while preserving prior knowledge. These findings underscore the critical role of the source and locality of training states in determining model capabilities and catastrophic forgetting.

📝 Abstract

Large language model post-training methods such as supervised fine-tuning (SFT), reinforcement learning (RL), and distillation are often analyzed through their loss functions: maximum likelihood, policy gradients, forward KL, reverse KL, or related objective-level variants. We study a complementary factor: the state distribution on which supervision is applied. For an autoregressive policy, a state is a prompt plus generated prefix. SFT trains on fixed dataset states, while RL and on-policy distillation (OPD) train on states induced by the current learner. We formalize post-training as state-distribution shaping and run a controlled smallscale study using Qwen3-0.6B-Base on GSM8K, with TruthfulQA and MMLU as retention evaluations. Our results show three phenomena. First, a mild SFT run improves GSM8K with little forgetting, while a stress SFT run causes substantial retention loss. Second, OPD from a degraded SFT teacher surpasses that teacher on GSM8K, TruthfulQA, and MMLU, despite using the teacher as its only supervision source. Third, a lightweight on-policy RL run improves GSM8K while preserving retention. These results support a state-centric view of post-training: the source and locality of training states can be as important as the form of the supervision signal.

Problem

Research questions and friction points this paper is trying to address.

post-training

state distribution

supervised fine-tuning

reinforcement learning

on-policy distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

state distribution

post-training

on-policy distillation