🤖 AI Summary
To address the challenge of automatically assessing code review quality in the absence of human-written reference comments, this paper proposes CRScore—the first reference-free, semantics-driven, fine-grained metric for review quality evaluation. CRScore jointly leverages large language models’ code semantic understanding and static analysis–detected code smells to quantify comment quality along three dimensions: conciseness, comprehensiveness, and relevance. Unlike conventional reference-based metrics, CRScore overcomes the semantic alignment bottleneck and achieves the highest human–machine agreement (Spearman ρ = 0.54) on an open-source benchmark, demonstrating superior sensitivity to quality differences. Additionally, the paper introduces a high-quality, manually annotated evaluation corpus comprising 2.9K review–code pairs, establishing a new unsupervised paradigm and foundational resource for code review assessment.
📝 Abstract
The task of automated code review has recently gained a lot of attention from the machine learning community. However, current review comment evaluation metrics rely on comparisons with a human-written reference for a given code change (also called a diff). Furthermore, code review is a one-to-many problem, like generation and summarization, with many"valid reviews"for a diff. Thus, we develop CRScore - a reference-free metric to measure dimensions of review quality like conciseness, comprehensiveness, and relevance. We design CRScore to evaluate reviews in a way that is grounded in claims and potential issues detected in the code by LLMs and static analyzers. We demonstrate that CRScore can produce valid, fine-grained scores of review quality that have the greatest alignment with human judgment among open source metrics (0.54 Spearman correlation) and are more sensitive than reference-based metrics. We also release a corpus of 2.9k human-annotated review quality scores for machine-generated and GitHub review comments to support the development of automated metrics.