Online SFT for LLM Reasoning: Surprising Effectiveness of Self-Tuning without Rewards

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the reliance of large language models (LLMs) on complex reward mechanisms—such as RLVR—for improving mathematical reasoning. We propose Online Supervised Fine-Tuning (OSFT), a reward-free method that performs single-trajectory, self-generated reasoning sampling followed by immediate supervised fine-tuning to enable dynamic parameter updates. OSFT is the first approach to empirically validate the effectiveness and robustness of reward-free, single-trajectory self-fine-tuning for mathematical reasoning, demonstrating that pre-trained implicit knowledge can be fully unlocked through autonomous self-refinement. On benchmarks including GSM8K, OSFT matches the performance of strong reward-driven methods like GRPO while drastically reducing training overhead. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
We present a simple, self-help online supervised finetuning (OSFT) paradigm for LLM reasoning. In this paradigm, the model generates its own responses and is immediately finetuned on this self-generated data. OSFT is a highly efficient training strategy for LLM reasoning, as it is reward-free and uses just one rollout by default. Experiment results show that OSFT achieves downstream performance on challenging mathematical reasoning tasks comparable to strong reinforcement learning with verifiable rewards (RLVR) methods such as GRPO. Our ablation study further demonstrates the efficiency and robustness of OSFT. The major mechanism of OSFT lies in facilitating the model's own existing preference (latent knowledge) learned from pretraining, which leads to reasoning ability improvement. We believe that OSFT offers an efficient and promising alternative to more complex, reward-based training paradigms. Our code is available at https://github.com/ElementQi/OnlineSFT.
Problem

Research questions and friction points this paper is trying to address.

Develops reward-free self-tuning for LLM reasoning
Enhances reasoning via model's latent pretraining knowledge
Achieves performance comparable to reinforcement learning methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online supervised finetuning without reward signals
Self-generated data for immediate model finetuning
Leverages model's latent knowledge for reasoning improvement
🔎 Similar Papers
No similar papers found.
M
Mengqi Li
The Chinese University of Hong Kong, Shenzhen
L
Lei Zhao
Shanghai Jiao Tong University
Anthony Man-Cho So
Anthony Man-Cho So
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong
Mathematical OptimizationDesign and Analysis of AlgorithmsSignal Processing
R
Ruoyu Sun
The Chinese University of Hong Kong, Shenzhen
X
Xiao Li
The Chinese University of Hong Kong, Shenzhen