Directional Diffusion-Style Code Editing Pre-training

πŸ“… 2025-01-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing pre-trained code models struggle to capture realistic, incremental code modification processes, limiting their effectiveness on code evolution and editing tasks. To address this, we propose DivoT5β€”the first code pre-training framework incorporating a **directed diffusion mechanism**. It introduces two novel pre-training objectives: **bidirectional evolutionary denoising** and **intermediate-state directed evolution**, explicitly modeling code evolution paths. Built upon the CodeT5 architecture, DivoT5 jointly integrates masked denoising, directed diffusion modeling, intermediate version generation, and evolution path reinforcement. Experiments demonstrate that DivoT5 (220M) achieves state-of-the-art performance on mainstream code editing benchmarks among models of comparable size. Remarkably, under few-shot settings, it surpasses significantly larger 6.7B- and 8B-parameter models. Moreover, even with only 60M parameters, DivoT5 outperforms the 220M CodeT5-base on automated code review tasks.

Technology Category

Application Category

πŸ“ Abstract
Code pre-trained models have shown promising effectiveness in various software engineering tasks. Among these tasks, many tasks are related to software evolution and/or code editing. However, existing code pre-trained models often overlook the real-world code editing data and the evolutionary nature of the editing process. In this paper, to simulate the step-by-step code editing process of human developers, we propose DivoT5, a pre-trained model based on directional diffusion at the data level. In DivoT5, we adopt two categories of pre-training tasks. The first category is mask and denoising tasks augmented with a diffusion direction representing code evolution. That is, we first apply a noising process to the code snippets before evolution, and then ask the pre-training process to restore the snippets with noise into the code snippets after evolution. The second category is tasks aiming to reinforce the evolutionary direction. That is, we first generate various intermediate versions for each pair of snippets before and after evolution, and then ask the pre-training process to transform the intermediate versions into the snippet after evolution for each pair. We evaluate DivoT5 for two code-editing scenarios and one non-editing scenario using five downstream tasks. Given each downstream task, we fine-tune the pre-trained DivoT5 to evaluate its effectiveness. Our experimental results show that DivoT5 achieves state-of-the-art (SOTA) performance on most tasks in comparison to models of the same scale (220M), large scale (770M) models in fine-tuning, and billion-scale (6.7B, 8B, ChatGPT) models in few-shot settings. For one code-editing task (i.e., automated code review), DivoT5 pre-trained on top of CodeT5-small (60M) can even outperform CodeT5-base (220M) and other pre-trained models with 220M parameters except for DivoT5 pre-trained on top of CodeT5-base (220M).
Problem

Research questions and friction points this paper is trying to address.

Pre-trained Code Models
Real-world Code Modification
Adaptability Limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

DivoT5
Pre-trained Model
Code Refinement
πŸ”Ž Similar Papers
No similar papers found.
Qingyuan Liang
Qingyuan Liang
Peking University
Software EngineeringCode Generation
Z
Zeyu Sun
National Key Laboratory of Space Integrated Information System, Institute of Software, Chinese Academy of Sciences, Beijing, China
Qihao Zhu
Qihao Zhu
Peking University
software engineering
J
Junhao Hu
Key Lab of HCST (PKU), MOE; SCS, Peking University, Beijing 100871, China
Y
Yifan Zhao
Key Lab of HCST (PKU), MOE; SCS, Peking University, Beijing 100871, China
Yizhou Chen
Yizhou Chen
Peking University
AI4SEVulnerability DetectionFormal Verification
Mingxuan Zhu
Mingxuan Zhu
Peking University
G
Guoqing Wang
Key Lab of HCST (PKU), MOE; SCS, Peking University, Beijing 100871, China
L
Lu Zhang
Key Lab of HCST (PKU), MOE; SCS, Peking University, Beijing 100871, China