CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells

📅 2024-09-29
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of automatically assessing code review quality in the absence of human-written reference comments, this paper proposes CRScore—the first reference-free, semantics-driven, fine-grained metric for review quality evaluation. CRScore jointly leverages large language models’ code semantic understanding and static analysis–detected code smells to quantify comment quality along three dimensions: conciseness, comprehensiveness, and relevance. Unlike conventional reference-based metrics, CRScore overcomes the semantic alignment bottleneck and achieves the highest human–machine agreement (Spearman ρ = 0.54) on an open-source benchmark, demonstrating superior sensitivity to quality differences. Additionally, the paper introduces a high-quality, manually annotated evaluation corpus comprising 2.9K review–code pairs, establishing a new unsupervised paradigm and foundational resource for code review assessment.

Technology Category

Application Category

📝 Abstract
The task of automated code review has recently gained a lot of attention from the machine learning community. However, current review comment evaluation metrics rely on comparisons with a human-written reference for a given code change (also called a diff). Furthermore, code review is a one-to-many problem, like generation and summarization, with many"valid reviews"for a diff. Thus, we develop CRScore - a reference-free metric to measure dimensions of review quality like conciseness, comprehensiveness, and relevance. We design CRScore to evaluate reviews in a way that is grounded in claims and potential issues detected in the code by LLMs and static analyzers. We demonstrate that CRScore can produce valid, fine-grained scores of review quality that have the greatest alignment with human judgment among open source metrics (0.54 Spearman correlation) and are more sensitive than reference-based metrics. We also release a corpus of 2.9k human-annotated review quality scores for machine-generated and GitHub review comments to support the development of automated metrics.
Problem

Research questions and friction points this paper is trying to address.

Develops CRScore for automated code review evaluation.
Measures review quality without human-written references.
Aligns with human judgment better than existing metrics.
Innovation

Methods, ideas, or system contributions that make the work stand out.

CRScore: reference-free metric for code review quality
Uses LLMs and static analyzers for code issue detection
Corpus of 2.9k human-annotated review scores released
🔎 Similar Papers
No similar papers found.
Atharva Naik
Atharva Naik
PhD Student, Carnegie Mellon University
LLM4CodeLLM ReasoningAlignment
M
Marcus Alenius
Language Technologies Institute, Carnegie Mellon University
Daniel Fried
Daniel Fried
Carnegie Mellon University
Natural Language ProcessingMachine Learning
C
Carolyn Rose
Language Technologies Institute, Carnegie Mellon University