Studying Quality Improvements Recommended via Manual and Automated Code Review

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study presents the first systematic comparison between large language models (specifically ChatGPT-4) and human reviewers in identifying and recommending improvements for code quality during real-world code reviews. Leveraging 739 human-generated review comments from 240 pull requests, the authors constructed an evaluation benchmark through data mining and manual categorization, then prompted ChatGPT to perform automated reviews on the same code changes. Results show that ChatGPT produced 2.4 times more suggestions on average than humans but covered only 10% of the issues identified by human reviewers. Notably, approximately 40% of ChatGPT’s additional suggestions were deemed practically valuable. The findings indicate that while large language models cannot yet replace human reviewers, they offer complementary value by surfacing useful supplementary recommendations.

Technology Category

Application Category

📝 Abstract
Several Deep Learning (DL)-based techniques have been proposed to automate code review. Still, it is unclear the extent to which these approaches can recommend quality improvements as a human reviewer. We study the similarities and differences between code reviews performed by humans and those automatically generated by DL models, using ChatGPT-4 as representative of the latter. In particular, we run a mining-based study in which we collect and manually inspect 739 comments posted by human reviewers to suggest code changes in 240 PRs. The manual inspection aims at classifying the type of quality improvement recommended by human reviewers (e.g., rename variable/constant). Then, we ask ChatGPT to perform a code review on the same PRs and we compare the quality improvements it recommends against those suggested by the human reviewers. We show that while, on average, ChatGPT tends to recommend a higher number of code changes as compared to human reviewers (~2.4x more), it can only spot 10% of the quality issues reported by humans. However, ~40% of the additional comments generated by the LLM point to meaningful quality issues. In short, our findings show the complementarity of manual and AI-based code review. This finding suggests that, in its current state, DL-based code review can be used as a further quality check on top of the one performed by humans, but should not be considered as a valid alternative to them nor as a mean to save code review time, since human reviewers would still need to perform their manual inspection while also validating the quality issues reported by the DL-based technique.
Problem

Research questions and friction points this paper is trying to address.

code review
quality improvement
deep learning
automated code review
human-AI collaboration
Innovation

Methods, ideas, or system contributions that make the work stand out.

code review
large language models
software quality
human-AI collaboration
empirical study