Diffusion is a code repair operator and generator

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

143K/year

🤖 AI Summary

This paper addresses the “last-mile” problem in code repair—fixing near-complete code snippets that are semantically correct but contain minor syntactic or logical errors. We propose a dual-purpose framework based on pretrained code diffusion models. Methodologically, we inject noise into corrupted code and explicitly model discrete latent-state transitions during the late-stage denoising process as fine-grained repair operations. Concurrently, high-quality repair pairs are automatically generated from intermediate and final denoised samples, unifying code repair and synthetic data generation. Our key contribution is the first identification of a principled correspondence between diffusion trajectories and concrete repair operations, enabling joint modeling of repair operators and data synthesizers. Evaluated across Python, Excel, and PowerShell domains, our approach significantly improves repair accuracy and cross-domain generalization, while supporting multi-stage code representation learning.

Technology Category

Application Category

📝 Abstract

Code diffusion models generate code by iteratively removing noise from the latent representation of a code snippet. During later steps of the diffusion process, when the code snippet has almost converged, differences between discrete representations of these snippets look like last-mile repairs applied to broken or incomplete code. We evaluate the extent to which this resemblance can be exploited to leverage pre-trained code diffusion models for the problem of last-mile repair by considering two applications with significant potential. First, we can leverage the diffusion model for last-mile repair by adding noise to a broken code snippet and resuming the diffusion process. Second, we can leverage the diffusion model to generate arbitrary amount of training data for last-mile repair tasks (that are computationally more efficient) by sampling an intermediate program (input) and the final program (output) from the diffusion process. We perform experiments on 3 domains (Python, Excel and PowerShell) to evaluate applications, as well as analyze properties.

Problem

Research questions and friction points this paper is trying to address.

Leveraging code diffusion models for last-mile repair

Generating training data for efficient last-mile repair tasks

Evaluating applications across Python, Excel, and PowerShell domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses noise removal for code generation

Leverages diffusion for last-mile repair

Generates training data via diffusion sampling

🔎 Similar Papers

No similar papers found.