CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models

📅 2025-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Despite their proficiency in code generation, large language models (LLMs) struggle to comprehend implicit, ambiguous, or colloquial code review comments—limiting their practical deployment in real-world software engineering. To address this gap, we introduce the first fine-grained benchmark for code review understanding, featuring a novel three-stage decoupled evaluation paradigm: (1) change-type identification, (2) change localization, and (3) solution identification—implemented via multiple-choice questions to avoid text-matching bias and training-data contamination. The benchmark spans nine programming languages, evaluates 72 mainstream LLMs, and comprises 900 high-quality, human-annotated instances, enabling cross-lingual, multi-difficulty, and attributable capability diagnostics. Empirical analysis reveals substantial weaknesses in change localization and demonstrates weak correlation between understanding performance and generative revision quality—validating the independence and measurability of code review understanding as a distinct competency.

Technology Category

Application Category

📝 Abstract
State-of-the-art large language models (LLMs) have demonstrated impressive code generation capabilities but struggle with real-world software engineering tasks, such as revising source code to address code reviews, hindering their practical use. Code review comments are often implicit, ambiguous, and colloquial, requiring models to grasp both code and human intent. This challenge calls for evaluating large language models' ability to bridge both technical and conversational contexts. While existing work has employed the automated code refinement (ACR) task to resolve these comments, current evaluation methods fall short, relying on text matching metrics that provide limited insight into model failures and remain susceptible to training data contamination. To address these limitations, we introduce a novel evaluation benchmark, $ extbf{CodeReviewQA}$ that enables us to conduct fine-grained assessment of model capabilities and mitigate data contamination risks. In CodeReviewQA, we decompose the generation task of code refinement into $ extbf{three essential reasoning steps}$: $ extit{change type recognition}$ (CTR), $ extit{change localisation}$ (CL), and $ extit{solution identification}$ (SI). Each step is reformulated as multiple-choice questions with varied difficulty levels, enabling precise assessment of model capabilities, while mitigating data contamination risks. Our comprehensive evaluation spans 72 recently released large language models on $ extbf{900 manually curated, high-quality examples}$ across nine programming languages. Our results show that CodeReviewQA is able to expose specific model weaknesses in code review comprehension, disentangled from their generative automated code refinement results.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs' ability to handle code review comments effectively.
Assess models' comprehension of technical and conversational contexts in code reviews.
Mitigate data contamination risks in evaluating code refinement tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CodeReviewQA for fine-grained model assessment
Decomposes code refinement into three reasoning steps
Uses multiple-choice questions to mitigate data contamination