Too Noisy To Learn: Enhancing Data Quality for Code Review C

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Semantic noise—such as ambiguity and non-actionability—is pervasive in code review datasets, yet existing heuristic- and supervised-learning-based cleaning approaches fail to adequately identify it, thereby limiting the quality of automated review comment generation. This paper introduces large language models (LLMs) to the code review data cleaning task for the first time, proposing a fine-grained noise detection method grounded in prompt engineering and empirical evaluation, which overcomes the semantic understanding limitations of conventional techniques. Experiments demonstrate that our method achieves cleaning precision of 66–85%. When fine-tuned on the cleaned data, comment generation models yield outputs with 12.4–13.0% higher similarity to human-written comments, alongside significant improvements in informativeness and relevance. This work establishes a novel paradigm for constructing high-quality review datasets and enabling controllable, semantically grounded comment generation.

Technology Category

Application Category

📝 Abstract

Code review is an important practice in software development, yet it is time-consuming and requires substantial effort. While open-source datasets have been used to train neural models for automating code review tasks, including review comment generation, these datasets contain a significant amount of noisy comments (e.g., vague or non-actionable feedback) that persist despite cleaning methods using heuristics and machine learning approaches. Such remaining noise may lead models to generate low-quality review comments, yet removing them requires a complex semantic understanding of both code changes and natural language comments. In this paper, we investigate the impact of such noise on review comment generation and propose a novel approach using large language models (LLMs) to further clean these datasets. Based on an empirical study on a large-scale code review dataset, our LLM-based approach achieves 66-85% precision in detecting valid comments. Using the predicted valid comments to fine-tune the state-of-the-art code review models (cleaned models) can generate review comments that are 13.0% - 12.4% more similar to valid human-written comments than the original models. We also find that the cleaned models can generate more informative and relevant comments than the original models. Our findings underscore the critical impact of dataset quality on the performance of review comment generation. We advocate for further research into cleaning training data to enhance the practical utility and quality of automated code review.

Problem

Research questions and friction points this paper is trying to address.

Improving code review dataset quality

Reducing noise in review comments

Enhancing automated review comment generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based noise detection

Fine-tuning with valid comments

Enhanced dataset quality impact

🔎 Similar Papers

CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells

2024-09-29arXiv.orgCitations: 0

Microsoft

$119,800 -

San Francisco Bay area / New York City metropolitan area

Authors to Follow