Training Language Models on Synthetic Edit Sequences Improves Code Synthesis

📅 2024-10-03
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing code synthesis models predominantly rely on single-step autoregressive generation, diverging from developers’ iterative editing practices—largely due to the scarcity of high-quality, semantically consistent program edit sequence data. Method: We propose LintSeq, the first linter-guided synthetic edit sequence generation method. It transforms instruction-program pairs into instruction-program-delta sequence tuples, enabling syntax- and semantics-aware program-level edit modeling via static analysis. Leveraging this, we construct an edit-oriented fine-tuning dataset and adopt a two-stage paradigm: pretraining followed by edit-specific fine-tuning, applied to models ranging from 2.6B to 14B parameters. Results: Experiments on HumanEval and MBPP(+) show comparable or improved pass@1 accuracy, while achieving substantially higher FLOPs efficiency at high pass@k—outperforming CodeT5+, AlphaCode, and Codex.

Technology Category

Application Category

📝 Abstract
Software engineers mainly write code by editing existing programs. In contrast, language models (LMs) autoregressively synthesize programs in a single pass. One explanation for this is the scarcity of sequential edit data. While high-quality instruction data for code synthesis is scarce, edit data for synthesis is even scarcer. To fill this gap, we develop a synthetic data generation algorithm called LintSeq. This algorithm refactors programs into sequences of synthetic edits by using a linter to procedurally sample across interdependent lines of source code. Synthetic edits sampled with LintSeq reflect the syntax and semantics of their programming language. To test the algorithm, we use it to refactor a dataset of instruction + program pairs into instruction + program-diff-sequence tuples. Then, we fine-tune a series of smaller LMs ranging from 2.6B to 14B parameters on both the re-factored and original versions of this dataset. We perform comprehensive evaluations comparing edit sequence code LMs against baselines on HumanEval, MBPP(+), CodeContests, DS-1000, and BigCodeBench. We show that models fine-tuned to iteratively synthesize code match or outperform baselines on pass@1, and exhibit better scaling across higher pass@k as a function of total test-time FLOPs. Finally, we also pretrain our own tiny LMs for code understanding. We show that fine-tuning these models to synthesize code edit-by-edit results in strong performance on HumanEval and MBPP(+) compared to existing code language models of similar scale such as CodeT5+, AlphaCode, and Codex.
Problem

Research questions and friction points this paper is trying to address.

Synthetic data for code edit sequences
Improving language models for code synthesis
Fine-tuning models on synthetic edit sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic edit sequences generation
Fine-tuning smaller language models
Iterative code synthesis improvement
🔎 Similar Papers
No similar papers found.