Too Noisy To Learn: Enhancing Data Quality for Code Review C

📅 2025-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Semantic noise—such as ambiguity and non-actionability—is pervasive in code review datasets, yet existing heuristic- and supervised-learning-based cleaning approaches fail to adequately identify it, thereby limiting the quality of automated review comment generation. This paper introduces large language models (LLMs) to the code review data cleaning task for the first time, proposing a fine-grained noise detection method grounded in prompt engineering and empirical evaluation, which overcomes the semantic understanding limitations of conventional techniques. Experiments demonstrate that our method achieves cleaning precision of 66–85%. When fine-tuned on the cleaned data, comment generation models yield outputs with 12.4–13.0% higher similarity to human-written comments, alongside significant improvements in informativeness and relevance. This work establishes a novel paradigm for constructing high-quality review datasets and enabling controllable, semantically grounded comment generation.

Technology Category

Application Category

📝 Abstract
Code review is an important practice in software development, yet it is time-consuming and requires substantial effort. While open-source datasets have been used to train neural models for automating code review tasks, including review comment generation, these datasets contain a significant amount of noisy comments (e.g., vague or non-actionable feedback) that persist despite cleaning methods using heuristics and machine learning approaches. Such remaining noise may lead models to generate low-quality review comments, yet removing them requires a complex semantic understanding of both code changes and natural language comments. In this paper, we investigate the impact of such noise on review comment generation and propose a novel approach using large language models (LLMs) to further clean these datasets. Based on an empirical study on a large-scale code review dataset, our LLM-based approach achieves 66-85% precision in detecting valid comments. Using the predicted valid comments to fine-tune the state-of-the-art code review models (cleaned models) can generate review comments that are 13.0% - 12.4% more similar to valid human-written comments than the original models. We also find that the cleaned models can generate more informative and relevant comments than the original models. Our findings underscore the critical impact of dataset quality on the performance of review comment generation. We advocate for further research into cleaning training data to enhance the practical utility and quality of automated code review.
Problem

Research questions and friction points this paper is trying to address.

Improving code review dataset quality
Reducing noise in review comments
Enhancing automated review comment generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based noise detection
Fine-tuning with valid comments
Enhanced dataset quality impact
🔎 Similar Papers
No similar papers found.