UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address unstable pretraining data quality, insufficient instruction diversity, and model bias in code generation, this paper proposes UnitCoder: a framework that leverages large language models (LLMs) to automatically generate unit tests as dual constraints—guiding synthetic program generation and verifying functional correctness—combined with package-level dependency retrieval to scale the synthesis of executable, API-diverse, high-quality Python programs. UnitCoder introduces the first test-driven iterative synthesis-and-verification paradigm, yielding a dataset of over 500K validated programs. Fine-tuning Llama3.1-8B and InternLM2.5-7B on this dataset achieves pass@1 improvements of +9% and +11%, respectively, on BigCodeBench—significantly outperforming baselines. These results empirically validate that test-guided synthesis substantially enhances code generation consistency, API diversity, and reliability.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Current approaches for obtaining high-quality code data primarily focus on (i) collecting large-scale pre-training data and (ii) synthesizing instruction data through prompt engineering with powerful models. While pre-training data faces quality consistency issues, instruction-based synthesis suffers from limited instruction diversity and inherent biases of LLMs. To address this gap, we introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to both guide and validate the code generation process. Combined with large-scale package-based retrieval from pre-training corpus, we generate a dataset of 500K+ verifiable programs containing diverse API calls. Evaluations on multiple Python benchmarks (BigCodeBench, HumanEval, MBPP) demonstrate that models fine-tuned on our synthetic data exhibit consistent performance improvements. Notably, Llama3.1-8B and InternLM2.5-7B improve from 31% and 28% to 40% and 39% success rates on BigCodeBench, respectively. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora, demonstrating the potential for producing diverse and high-quality post-training data at scale. All code and data will be released (https://github.com).
Problem

Research questions and friction points this paper is trying to address.

Synthesizes verifiable code using unit tests
Improves code generation quality and diversity
Enhances performance on Python benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages model-generated unit tests
Combines package-based retrieval
Produces diverse verifiable programs
🔎 Similar Papers
No similar papers found.
Yichuan Ma
Yichuan Ma
Fudan University
LLMSynthetic Data
Yunfan Shao
Yunfan Shao
Fudan University
Natrual Language ProcessingMachine Learning
Peiji Li
Peiji Li
Fudan University
D
Demin Song
Shanghai AI Laboratory, Shanghai
Qipeng Guo
Qipeng Guo
Fudan University
L
Linyang Li
Shanghai AI Laboratory, Shanghai
X
Xipeng Qiu
School of Computer Science, Fudan University, Shanghai
K
Kai Chen
Shanghai AI Laboratory, Shanghai