Theoretical Perspectives on Data Quality and Synergistic Effects in Pre- and Post-Training Reasoning Models

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study investigates the distinct mechanisms underlying data scale and quality requirements in pretraining versus post-training phases—specifically supervised fine-tuning (SFT) and reinforcement learning (RL)—and redefines what constitutes high-quality SFT data. By formulating a context-weight prediction task based on linear regression, the authors theoretically analyze the learning dynamics of Transformers and validate their findings through large-scale experiments with nonlinear Transformer models. The work provides the first theoretical evidence that balanced pretraining data effectively unlocks a model’s latent capabilities, while SFT achieves optimal performance with small-scale, high-difficulty data, and RL relies on large-scale, medium-difficulty data. These insights offer a novel theoretical foundation and perspective for designing effective post-training data strategies.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are pretrained on massive datasets and later instruction-tuned via supervised fine-tuning (SFT) or reinforcement learning (RL). Best practices emphasize large, diverse pretraining data, whereas post-training operates differently: SFT relies on smaller, high-quality datasets, while RL benefits more from scale, with larger amounts of feedback often outweighing label quality. Yet it remains unclear why pretraining and RL require large datasets, why SFT excels on smaller ones, and what defines high-quality SFT data. In this work, we theoretically analyze transformers trained on an in-context weight prediction task for linear regression. Our analysis reveals several key findings: $(i)$ balanced pretraining data can induce latent capabilities later activated during post-training, and $(ii)$ SFT learns best from a small set of examples challenging for the pretrained model, while excessively large SFT datasets may dilute informative pretraining signals. In contrast, RL is most effective on large-scale data that is not overly difficult for the pretrained model. We validate these theoretical insights with experiments on large nonlinear transformer architectures.

Problem

Research questions and friction points this paper is trying to address.

data quality

pretraining

supervised fine-tuning

reinforcement learning

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

data quality

pretraining

supervised fine-tuning