Understanding the Limits of Automated Evaluation for Code Review Bots in Practice

📅 2026-04-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

181K/year
🤖 AI Summary
This study addresses the challenge that developer-provided usefulness labels for LLM-generated code review comments in industrial settings are often biased by workflow pressures and organizational factors, undermining their reliability as ground truth. Leveraging 2,604 LLM-generated comments and corresponding engineer annotations from Beko, this work presents the first systematic evaluation in a real-world industrial context of the alignment between human feedback and two automated evaluation paradigms: G-Eval and LLM-as-a-Judge. The experiments span multiple models—including Gemini-2.5-pro, GPT-4.1-mini, and GPT-5.2—and incorporate qualitative validation through interviews with engineering leads. Results reveal only moderate agreement (0.44–0.62) between automated metrics and human labels, with performance sensitive to both model choice and evaluation design, thereby exposing the limitations of relying solely on automated assessments and challenging the common assumption that developer annotations constitute a gold standard.

Technology Category

Application Category

📝 Abstract
Automated code review (ACR) bots are increasingly used in industrial software development to assist developers during pull request (PR) review. As adoption grows, a key challenge is how to evaluate the usefulness of bot-generated comments reliably and at scale. In practice, such evaluation often relies on developer actions and annotations that are shaped by contextual and organizational factors, complicating their use as objective ground truth. We examine the feasibility and limitations of automating the evaluation of LLM-powered ACR bots in an industrial setting. We analyze an industrial dataset from Beko comprising 2,604 bot-generated PR comments, each labeled by software engineers as fixed/wontFix. Two automated evaluation approaches, G-Eval and an LLM-as-a-Judge pipeline, are applied using both binary decisions and a 0-4 Likert-scale formulation, enabling a controlled comparison against developer-provided labels. Across Gemini-2.5-pro, GPT-4.1-mini, and GPT-5.2, both evaluation strategies achieve only moderate alignment with human labels. Agreement ratios range from approximately 0.44 to 0.62, with noticeable variation across models and between binary and Likert-scale formulations, indicating sensitivity to both model choice and evaluation design. Our findings highlight practical limitations in fully automating the evaluation of ACR bot comments in industrial contexts. Developer actions such as resolving or ignoring comments reflect not only comment quality, but also contextual constraints, prioritization decisions, and workflow dynamics that are difficult to capture through static artifacts. Insights from a follow-up interview with a software engineering director further corroborate that developer labeling behavior is strongly influenced by workflow pressures and organizational constraints, reinforcing the challenges of treating such signals as objective ground truth.
Problem

Research questions and friction points this paper is trying to address.

automated code review
evaluation reliability
developer feedback
LLM evaluation
industrial software development
Innovation

Methods, ideas, or system contributions that make the work stand out.

automated code review
LLM evaluation
developer feedback
industrial dataset
evaluation limitations