OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

The scarcity of large-scale, high-quality datasets for code generation and critical reasoning hinders the development of large language models’ code comprehension and self-correction capabilities. Method: We introduce CodeReasoning, a dataset comprising 2.5 million question–answer–explanation triples, and propose a two-stage supervised fine-tuning framework: (1) joint training on code generation and critical reasoning tasks; and (2) test-time self-correction via iterative refinement. Built upon Qwen2.5-Instruct, our approach is the first to support C++ and extend the LiveCodeBench benchmark. Results: The fine-tuned model achieves state-of-the-art performance among open-source models on standard code generation benchmarks. Integrating the critical reasoning module yields substantial gains on competitive programming tasks, empirically validating the effectiveness and scalability of generative–critical co-modeling for robust code synthesis and self-improvement.

Technology Category

Application Category

📝 Abstract

Recent advancements in reasoning-based Large Language Models (LLMs), particularly their potential through test-time scaling, have created significant opportunities for distillation in code generation and critique. However, progress in both areas fundamentally depends on large-scale, high-quality datasets. In this work, we introduce OpenCodeReasoning-II, a dataset consists of 2.5M question-solution-critique triples (approx. 35K unique programming questions), making it nearly twice the size of the previous largest publicly available code reasoning dataset. In this work, we employ a two-stage supervised fine-tuning strategy. The first stage focuses on fine-tuning for code generation, while the second stage involves the joint training of models for both code generation and critique. Our resulting finetuned Qwen2.5-Instruct models achieve performance in code generation that either exceeds or equals the best prior open-weight distilled models. Notably, the integration of our code generation and critique models leads to significant improvements in competitive coding performance. Furthermore, we present an extension of the LiveCodeBench benchmark to specifically support the C++ programming language, thereby facilitating more comprehensive LLM evaluation using this benchmark.

Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale high-quality datasets for code reasoning

Need for improved code generation and critique models

Limited benchmark support for C++ in LLM evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage supervised fine-tuning strategy

Joint training for code generation and critique

Extension of LiveCodeBench for C++ evaluation

🔎 Similar Papers

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark