UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
To address unstable pretraining data quality, insufficient instruction diversity, and model bias in code generation, this paper proposes UnitCoder: a framework that leverages large language models (LLMs) to automatically generate unit tests as dual constraints—guiding synthetic program generation and verifying functional correctness—combined with package-level dependency retrieval to scale the synthesis of executable, API-diverse, high-quality Python programs. UnitCoder introduces the first test-driven iterative synthesis-and-verification paradigm, yielding a dataset of over 500K validated programs. Fine-tuning Llama3.1-8B and InternLM2.5-7B on this dataset achieves pass@1 improvements of +9% and +11%, respectively, on BigCodeBench—significantly outperforming baselines. These results empirically validate that test-guided synthesis substantially enhances code generation consistency, API diversity, and reliability.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Current approaches for obtaining high-quality code data primarily focus on (i) collecting large-scale pre-training data and (ii) synthesizing instruction data through prompt engineering with powerful models. While pre-training data faces quality consistency issues, instruction-based synthesis suffers from limited instruction diversity and inherent biases of LLMs. To address this gap, we introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to both guide and validate the code generation process. Combined with large-scale package-based retrieval from pre-training corpus, we generate a dataset of 500K+ verifiable programs containing diverse API calls. Evaluations on multiple Python benchmarks (BigCodeBench, HumanEval, MBPP) demonstrate that models fine-tuned on our synthetic data exhibit consistent performance improvements. Notably, Llama3.1-8B and InternLM2.5-7B improve from 31% and 28% to 40% and 39% success rates on BigCodeBench, respectively. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora, demonstrating the potential for producing diverse and high-quality post-training data at scale. All code and data will be released (https://github.com).
Problem

Research questions and friction points this paper is trying to address.

Synthesizes verifiable code using unit tests
Improves code generation quality and diversity
Enhances performance on Python benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages model-generated unit tests
Combines package-based retrieval
Produces diverse verifiable programs