🤖 AI Summary
High-quality open-source code instruction-tuning datasets are scarce, hindering the advancement of large language models (LLMs) in code generation, debugging, and reasoning.
Method: We introduce OpenCodeInstruct—the largest publicly available code instruction-tuning dataset to date (5 million samples), each comprising a programming problem, solution, test cases, execution feedback, and multi-dimensional LLM-based quality assessments. We propose a novel synthesis paradigm integrating execution feedback and LLM evaluation, implemented via a scalable “seed selection → synthetic augmentation → multi-stage filtering” pipeline. Instruction tuning is performed on base models (e.g., LLaMA, Qwen), augmented with program execution verification, self-feedback distillation, and consistency-based filtering.
Contribution/Results: Across HumanEval, MBPP, LiveCodeBench, and BigCodeBench, our approach improves average pass@1 rates by 12.6% for models of all scales (1B+, 3B+, 7B+), substantially advancing the capabilities of code-specialized LLMs.
📝 Abstract
Large Language Models (LLMs) have transformed software development by enabling code generation, automated debugging, and complex reasoning. However, their continued advancement is constrained by the scarcity of high-quality, publicly available supervised fine-tuning (SFT) datasets tailored for coding tasks. To bridge this gap, we introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples. Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments. We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset. Comprehensive evaluations on popular benchmarks (HumanEval, MBPP, LiveCodeBench, and BigCodeBench) demonstrate substantial performance improvements achieved by SFT with OpenCodeInstruct. We also present a detailed methodology encompassing seed data curation, synthetic instruction and solution generation, and filtering.