CodeDPO: Aligning Code Models with Self Generated and Verified Source Code

📅 2024-10-08

🏛️ arXiv.org

📈 Citations: 16

✨ Influential: 0

career value

148K/year

🤖 AI Summary

Existing code generation models struggle to simultaneously ensure correctness and runtime efficiency in ambiguous scenarios. This paper proposes a human-annotation-free preference learning alignment framework. First, it constructs high-quality code preference data via a self-generation–self-validation mechanism, innovatively integrating multi-code co-testing for robust verification. Second, it designs a PageRank-inspired iterative graph-ranking algorithm that jointly optimizes correctness and execution efficiency. Finally, it fine-tunes large language models using the derived preference data. The entire method is fully self-supervised and test-driven, requiring no external resources or manual annotation. Evaluated on five mainstream benchmarks, the approach achieves significant improvements in both code correctness and execution efficiency, while enhancing robustness. It establishes a novel, scalable paradigm for code preference modeling applicable to real-world deployment.

Technology Category

Application Category

📝 Abstract

Code generation models have shown significant potential for programming tasks. However, existing training methods like supervised fine-tuning face key limitations: they do not effectively teach models to prioritize correct over incorrect solutions in ambiguous situations, nor do they effectively optimize the runtime efficiency of the generated code. To address these challenges, we propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency. CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases. The underlying assumption is that test cases executable by multiple code snippets provide more reliable validation, and code that passes more tests is more likely to be correct. Through this self-validation process, our PageRank-inspired algorithm iteratively updates the ranking score of each code snippet, ultimately creating a code preference optimization dataset based on correctness and efficiency. CodeDPO is flexible and scalable, generating diverse preference optimization data without depending on external resources. Through comprehensive evaluations of five widely used benchmarks, CodeDPO demonstrates significant improvements in correctness and efficiency compared to existing methods. Our experiments prove that CodeDPO enhances the capabilities of LLMs in code generation and provides a robust foundation for conducting code preference optimization in more complex and challenging real-world scenarios.

Problem

Research questions and friction points this paper is trying to address.

Improves code correctness and efficiency in generation

Addresses limitations of supervised fine-tuning methods

Enhances model prioritization of correct solutions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates preference learning for code correctness

Uses self-generation-and-validation for dataset construction

Employs PageRank-inspired ranking for code optimization

🔎 Similar Papers

No similar papers found.