OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs

📅 2025-04-05

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

High-quality open-source code instruction-tuning datasets are scarce, hindering the advancement of large language models (LLMs) in code generation, debugging, and reasoning. Method: We introduce OpenCodeInstruct—the largest publicly available code instruction-tuning dataset to date (5 million samples), each comprising a programming problem, solution, test cases, execution feedback, and multi-dimensional LLM-based quality assessments. We propose a novel synthesis paradigm integrating execution feedback and LLM evaluation, implemented via a scalable “seed selection → synthetic augmentation → multi-stage filtering” pipeline. Instruction tuning is performed on base models (e.g., LLaMA, Qwen), augmented with program execution verification, self-feedback distillation, and consistency-based filtering. Contribution/Results: Across HumanEval, MBPP, LiveCodeBench, and BigCodeBench, our approach improves average pass@1 rates by 12.6% for models of all scales (1B+, 3B+, 7B+), substantially advancing the capabilities of code-specialized LLMs.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have transformed software development by enabling code generation, automated debugging, and complex reasoning. However, their continued advancement is constrained by the scarcity of high-quality, publicly available supervised fine-tuning (SFT) datasets tailored for coding tasks. To bridge this gap, we introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples. Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments. We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset. Comprehensive evaluations on popular benchmarks (HumanEval, MBPP, LiveCodeBench, and BigCodeBench) demonstrate substantial performance improvements achieved by SFT with OpenCodeInstruct. We also present a detailed methodology encompassing seed data curation, synthetic instruction and solution generation, and filtering.

Problem

Research questions and friction points this paper is trying to address.

Lack of high-quality public datasets for code LLMs

Need for diverse instruction tuning samples in coding

Improving model performance on code generation benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest open-access instruction tuning dataset

Fine-tuning base models with diverse samples

Comprehensive evaluations on popular benchmarks

🔎 Similar Papers

No similar papers found.