Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of training large language models (LLMs) from scratch to perform long-chain-of-thought (Long CoT) reasoning, where base models lack inherent complex deductive capabilities. To this end, we propose a two-stage curriculum-style supervised fine-tuning (SFT) coupled with semi-online policy direct preference optimization (DPO), augmented by GRPO—a reinforcement learning method designed to extend reasoning depth and enhance performance. Our contributions include: (i) the first end-to-end from-scratch training paradigm explicitly tailored for Long CoT; (ii) a high-quality, 3k-sample SFT dataset empirically validated for cross-model transferability; and (iii) the Light-R1-14B-DS model, which achieves 74.0 and 60.2 on AIME2024 and AIME2025, respectively—surpassing most 32B and 70B models, including DeepSeek-R1-Distill-Llama-70B. This demonstrates substantial improvements in both mathematical reasoning generalization and inference efficiency.

Technology Category

Application Category

📝 Abstract
This paper presents our work on the Light-R1 series, with models, data, and code all released. We first focus on training long COT models from scratch, specifically starting from models initially lacking long COT capabilities. Using a curriculum training recipe consisting of two-stage SFT and semi-on-policy DPO, we train our model Light-R1-32B from Qwen2.5-32B-Instruct, resulting in superior math performance compared to DeepSeek-R1-Distill-Qwen-32B. Despite being trained exclusively on math data, Light-R1-32B shows strong generalization across other domains. In the subsequent phase of this work, we highlight the significant benefit of the 3k dataset constructed for the second SFT stage on enhancing other models. By fine-tuning DeepSeek-R1-Distilled models using this dataset, we obtain new SOTA models in 7B and 14B, while the 32B model, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our work by applying reinforcement learning, specifically GRPO, on long-COT models to further improve reasoning performance. We successfully train our final Light-R1-14B-DS with RL, achieving SOTA performance among 14B parameter models in math. With AIME24&25 scores of 74.0 and 60.2 respectively, Light-R1-14B-DS surpasses even many 32B models and DeepSeek-R1-Distill-Llama-70B. Its RL training also exhibits well expected behavior, showing simultaneous increase in response length and reward score. The Light-R1 series of work validates training long-COT models from scratch, showcases the art in SFT data and releases SOTA models from RL.
Problem

Research questions and friction points this paper is trying to address.

Training long-chain-of-thought models from scratch.
Enhancing model performance using curriculum SFT and DPO.
Improving reasoning with reinforcement learning on long-COT models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum training with SFT and DPO
Reinforcement learning (GRPO) for reasoning
Fine-tuning with 3k dataset for SOTA models
🔎 Similar Papers
No similar papers found.
L
Liang Wen
Qiyuan Tech
Y
Yunke Cai
Qiyuan Tech
F
Fenrui Xiao
Qiyuan Tech
X
Xin He
Qiyuan Tech
Q
Qi An
Qiyuan Tech
Z
Zhenyu Duan
Qiyuan Tech
Y
Yimin Du
Qiyuan Tech
Junchen Liu
Junchen Liu
University of Texas Medical School, Houston, TX
cancer biology
L
Lifu Tang
Qiyuan Tech
X
Xiaowei Lv
Qiyuan Tech, Renmin University
Haosheng Zou
Haosheng Zou
Tsinghua University
Reinforcement Learning
Y
Yongchao Deng
Qiyuan Tech
Shousheng Jia
Shousheng Jia
360
llmnlpdeep retrieval
Xiangzheng Zhang
Xiangzheng Zhang
360
AI safetyLarge language modelsInformation Retrieval