100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models

📅 2025-05-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The insufficient open-sourcing and poor reproducibility of reasoning language models (RLMs) like DeepSeek-R1 hinder community advancement. Method: This study systematically surveys over 20 major open-source RLM replication efforts released within 100 days, focusing on two dominant paradigms—supervised fine-tuning (SFT) and verifiable-reward reinforcement learning (RLVR). Through comparative analysis of data construction, reward design, and training practices, it identifies technical consensus and discrepancies in RLM replication. Contribution/Results: We propose novel paradigms—including chain-of-thought data synthesis, verifiable reward modeling, and multi-stage distillation—and empirically demonstrate that SFT data quality and RLVR reward verifiability are decisive factors for reasoning performance. The work establishes a reproducible, scalable RLM development methodology, providing both theoretical foundations and engineering guidelines for the full-stack open-source RLM ecosystem.

Technology Category

Application Category

📝 Abstract
The recent development of reasoning language models (RLMs) represents a novel evolution in large language models. In particular, the recent release of DeepSeek-R1 has generated widespread social impact and sparked enthusiasm in the research community for exploring the explicit reasoning paradigm of language models. However, the implementation details of the released models have not been fully open-sourced by DeepSeek, including DeepSeek-R1-Zero, DeepSeek-R1, and the distilled small models. As a result, many replication studies have emerged aiming to reproduce the strong performance achieved by DeepSeek-R1, reaching comparable performance through similar training procedures and fully open-source data resources. These works have investigated feasible strategies for supervised fine-tuning (SFT) and reinforcement learning from verifiable rewards (RLVR), focusing on data preparation and method design, yielding various valuable insights. In this report, we provide a summary of recent replication studies to inspire future research. We primarily focus on SFT and RLVR as two main directions, introducing the details for data construction, method design and training procedure of current replication studies. Moreover, we conclude key findings from the implementation details and experimental results reported by these studies, anticipating to inspire future research. We also discuss additional techniques of enhancing RLMs, highlighting the potential of expanding the application scope of these models, and discussing the challenges in development. By this survey, we aim to help researchers and developers of RLMs stay updated with the latest advancements, and seek to inspire new ideas to further enhance RLMs.
Problem

Research questions and friction points this paper is trying to address.

Replicating DeepSeek-R1's performance with open-source data
Exploring SFT and RLVR methods for reasoning language models
Summarizing key findings to inspire future RLM research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Replication studies reproduce DeepSeek-R1 performance
Focus on supervised fine-tuning and RLVR methods
Open-source data and training procedures analyzed