rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current code reasoning research is hindered by scarce, high-difficulty, verifiable test cases. This paper introduces the first verifiable data synthesis paradigm tailored for competitive programming–level code reasoning, yielding a large-scale dataset comprising 418K problems and 580K long-chain reasoning solutions. Methodologically, it integrates oracle-driven problem synthesis, a three-stage input generation pipeline, and a bidirectional output mutual verification mechanism—enabling scalable, automated, triplet-level annotation (problem–solution–test case). Empirically, the dataset substantially enhances model generalization and verification capability: Qwen2.5-7B achieves 57.3% accuracy on LiveCodeBench—up from 17.4%—surpassing o3-mini; on USACO, its pass@1 reaches 16.15%, outperforming QWQ-32B.

Technology Category

Application Category

📝 Abstract
Advancing code reasoning in large language models (LLMs) is fundamentally limited by the scarcity of high-difficulty datasets, especially those with verifiable input-output test cases necessary for rigorous solution validation at scale. We introduce rStar-Coder, which significantly improves LLM code reasoning capabilities by constructing a large-scale, verified dataset of 418K competition-level code problems, 580K long-reasoning solutions along with rich test cases of varying difficulty. This is achieved through three core contributions: (1) we curate competitive programming code problems and oracle solutions to synthesize new, solvable problems; (2) we introduce a reliable input-output test case synthesis pipeline that decouples the generation into a three-step input generation method and a mutual verification mechanism for effective output labeling; (3) we augment problems with high-quality, test-case-verified long-reasoning solutions. Extensive experiments on Qwen models (1.5B-14B) across various code reasoning benchmarks demonstrate the superiority of rStar-Coder dataset, achieving leading performance comparable to frontier reasoning LLMs with much smaller model sizes. On LiveCodeBench, rStar-Coder improves Qwen2.5-7B from 17.4% to an impressive 57.3%, and Qwen2.5-14B from 23.3% to 62.5%, surpassing o3-mini (low) by3.1%. On the more challenging USA Computing Olympiad, our 7B model achieves an average pass@1 accuracy of 16.15%, outperforming the frontier-level QWQ-32B. Code and the dataset will be released at https://github.com/microsoft/rStar.
Problem

Research questions and friction points this paper is trying to address.

Scarcity of high-difficulty verified datasets for LLM code reasoning
Need for large-scale competition-level code problems with test cases
Improving code reasoning in smaller LLMs to match frontier models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs large-scale verified competition-level code dataset
Introduces reliable input-output test case synthesis pipeline
Augments problems with test-case-verified long-reasoning solutions
🔎 Similar Papers
No similar papers found.
Y
Yifei Liu
Microsoft Research Asia
Li Lyna Zhang
Li Lyna Zhang
Microsoft Research Asia
Artificial IntelligenceDeep LearningReinforcement LearningLong-Context
Y
Yi Zhu
Microsoft Research Asia
B
Bingcheng Dong
Microsoft Research Asia, Dalian University of Technology
X
Xudong Zhou
Microsoft Research Asia, Shanghai Jiao Tong University
Ning Shang
Ning Shang
Microsoft Research Asia
F
Fan Yang
Microsoft Research Asia
M
Mao Yang
Microsoft Research Asia