Harnessing Large Language Models for Curated Code Reviews

📅 2025-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code review comment generation methods suffer from poor generalizability due to high noise and low quality in publicly available datasets. Method: This paper proposes an LLM-driven data governance pipeline, introducing— for the first time—the collaborative data curation paradigm guided by a multi-dimensional evaluation framework. We systematically clean and structurally enhance the largest public code review dataset, incorporating LLM-assisted denoising, multi-criteria human validation, and joint fine-tuning of structured comment generation and code refinement. Contribution/Results: Experimental results demonstrate significant improvements: comment clarity and conciseness are enhanced; comment generation accuracy increases by 23.6%; code refinement repair success rate rises by 19.4%; and downstream task generalization capability is substantially strengthened.

Technology Category

Application Category

📝 Abstract
In code review, generating structured and relevant comments is crucial for identifying code issues and facilitating accurate code changes that ensure an efficient code review process. Well-crafted comments not only streamline the code review itself but are also essential for subsequent tasks like code refinement, where the code is modified to satisfy the input review comment. Although various AI-based approaches aimed to automate comment generation, their effectiveness remains limited by the quality of the training data. Existing code review datasets are often noisy and unrefined, posing limitations to the learning potential of AI models and hindering the automation process. To address these challenges, we propose a curation pipeline designed to enhance the quality of the largest publicly available code review dataset. We begin by establishing an evaluation framework, incorporating specific criteria and categories to empirically study the initial quality of the dataset. Using a large language model (LLM)-driven approach, we then apply our curation pipeline to refine the dataset. A comparative analysis of the newly curated dataset, based on the same evaluation framework, demonstrates substantial improvements in the clarity and conciseness of the comments. Additionally, we assess the impact of the curated dataset on automating downstream tasks, specifically comment generation and code refinement. Our findings show that the curated dataset leads to enhanced model performance in generating more accurate comments. Curated comments are also more useful as they lead to more accurate code refinement.
Problem

Research questions and friction points this paper is trying to address.

Improving code review comment quality
Refining noisy code review datasets
Enhancing AI-driven comment generation accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Curation Pipeline
Enhanced Dataset Quality
O
O. Sghaier
Université de Montréal, Montréal, Canada
Martin Weyssow
Martin Weyssow
Research Scientist, Singapore Management University
Deep Learning for CodeLarge Language ModelsAI4SE
H
H. Sahraoui
Université de Montréal, Montréal, Canada