Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the challenge of training large language models (LLMs) from scratch to perform long-chain-of-thought (Long CoT) reasoning, where base models lack inherent complex deductive capabilities. To this end, we propose a two-stage curriculum-style supervised fine-tuning (SFT) coupled with semi-online policy direct preference optimization (DPO), augmented by GRPO—a reinforcement learning method designed to extend reasoning depth and enhance performance. Our contributions include: (i) the first end-to-end from-scratch training paradigm explicitly tailored for Long CoT; (ii) a high-quality, 3k-sample SFT dataset empirically validated for cross-model transferability; and (iii) the Light-R1-14B-DS model, which achieves 74.0 and 60.2 on AIME2024 and AIME2025, respectively—surpassing most 32B and 70B models, including DeepSeek-R1-Distill-Llama-70B. This demonstrates substantial improvements in both mathematical reasoning generalization and inference efficiency.

Technology Category

Application Category

📝 Abstract

This paper presents our work on the Light-R1 series, with models, data, and code all released. We first focus on training long COT models from scratch, specifically starting from models initially lacking long COT capabilities. Using a curriculum training recipe consisting of two-stage SFT and semi-on-policy DPO, we train our model Light-R1-32B from Qwen2.5-32B-Instruct, resulting in superior math performance compared to DeepSeek-R1-Distill-Qwen-32B. Despite being trained exclusively on math data, Light-R1-32B shows strong generalization across other domains. In the subsequent phase of this work, we highlight the significant benefit of the 3k dataset constructed for the second SFT stage on enhancing other models. By fine-tuning DeepSeek-R1-Distilled models using this dataset, we obtain new SOTA models in 7B and 14B, while the 32B model, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our work by applying reinforcement learning, specifically GRPO, on long-COT models to further improve reasoning performance. We successfully train our final Light-R1-14B-DS with RL, achieving SOTA performance among 14B parameter models in math. With AIME24&25 scores of 74.0 and 60.2 respectively, Light-R1-14B-DS surpasses even many 32B models and DeepSeek-R1-Distill-Llama-70B. Its RL training also exhibits well expected behavior, showing simultaneous increase in response length and reward score. The Light-R1 series of work validates training long-COT models from scratch, showcases the art in SFT data and releases SOTA models from RL.

Problem

Research questions and friction points this paper is trying to address.

Training long-chain-of-thought models from scratch.

Enhancing model performance using curriculum SFT and DPO.

Improving reasoning with reinforcement learning on long-COT models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum training with SFT and DPO

Reinforcement learning (GRPO) for reasoning

Fine-tuning with 3k dataset for SOTA models

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL