Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the inefficiency of conventional data scaling strategies in supervised fine-tuning with long chain-of-thought reasoning. The authors propose an alternative approach that, under a fixed training budget, replaces large-scale single-epoch training with repeated training over a small dataset across many epochs. Their method integrates supervised fine-tuning, chain-of-thought data, multi-epoch training, and token-level accuracy monitoring, using training token accuracy as a stopping criterion. Experiments demonstrate that Olmo3-7B trained on only 400 samples for 128 epochs outperforms a model trained on 51,200 samples in a single epoch by 12–26 percentage points on the AIME'24/25 and GPQA benchmarks, without exhibiting catastrophic forgetting. These results validate that repeated training surpasses data scaling in enhancing both memorization and generalization capabilities.

Technology Category

Application Category

📝 Abstract

Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repetition: under a fixed update budget, training for more epochs on smaller datasets outperforms single-epoch training on larger datasets. On AIME'24/25 and GPQA benchmarks, Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. We find that training token accuracy reliably signals when repetition has saturated; improvements from additional epochs plateau at full memorization, a pattern consistent across all settings. These findings provide a practical approach for reasoning SFT, where scaling epochs with token accuracy as a stopping criterion can replace expensive undirected data scaling. We pose the repetition advantage, where full memorization coincides with improved generalization, as a new open problem for the community in understanding the training dynamics of large language models.

Problem

Research questions and friction points this paper is trying to address.

data repetition

supervised fine-tuning

chain-of-thought reasoning

generalization

training dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

data repetition

supervised fine-tuning

chain-of-thought reasoning