KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

High-quality, multi-difficulty, and verifiable code data for large language model (LLM) code training remains scarce. Method: We propose the first synthetic data construction framework based on problem-solution-test triples. It features: (1) a test-driven rejection-sampling and rewriting mechanism leveraging DeepSeek-R1 for difficulty adaptation and format diversity; and (2) an integrated pipeline combining multi-round solution generation, unit test completeness verification, post-training problem rewriting, and response distillation. Contribution/Results: The resulting dataset spans foundational syntax to advanced algorithmic tasks, with every sample containing an executable solution and executable, deterministic tests. Evaluated on HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench, our approach surpasses strong baselines—including Qwen2.5-Coder-32B-Instruct—achieving state-of-the-art performance across all benchmarks.

Technology Category

Application Category

📝 Abstract

We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training Large Language Models for coding. Existing code-focused resources typically fail to ensure either the breadth of coverage (e.g., spanning simple coding tasks to advanced algorithmic problems) or verifiable correctness (e.g., unit tests). In contrast, KodCode comprises question-solution-test triplets that are systematically validated via a self-verification procedure. Our pipeline begins by synthesizing a broad range of coding questions, then generates solutions and test cases with additional attempts allocated to challenging problems. Finally, post-training data synthesis is done by rewriting questions into diverse formats and generating responses under a test-based reject sampling procedure from a reasoning model (DeepSeek R1). This pipeline yields a large-scale, robust and diverse coding dataset. KodCode is suitable for supervised fine-tuning and the paired unit tests also provide great potential for RL tuning. Fine-tuning experiments on coding benchmarks (HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench) demonstrate that KodCode-tuned models achieve state-of-the-art performance, surpassing models like Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B.

Problem

Research questions and friction points this paper is trying to address.

Addresses lack of high-quality, verifiable coding training data

Ensures breadth of coverage and correctness via systematic validation

Enhances model performance on coding benchmarks through fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic dataset with question-solution-test triplets

Self-verification procedure ensures correctness

Post-training data synthesis enhances diversity

🔎 Similar Papers

CoIR: A Comprehensive Benchmark for Code Information Retrieval Models