CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automated code review (CR) benchmarks suffer from a “reality gap”: they typically focus on isolated subtasks and simplified datasets, lacking repository-level context and end-to-end evaluation capabilities. Method: We introduce RepoCR—the first realistic, repository-scale, context-rich, end-to-end CR benchmark—comprising 70 Python projects and 601 high-quality pull request (PR) instances spanning nine representative issue categories. Our framework integrates rule-based syntactic and positional checks with large language model (LLM)-driven quality assessment, and jointly models multi-source contextual signals (e.g., issue reports, PR descriptions, repository state) to enhance evaluation fidelity. Contribution/Results: We conduct the first systematic evaluation of mainstream LLMs on RepoCR, establishing critical baselines: Gemini 2.5 Pro achieves the best overall performance; notably, models exhibit significant divergence in review comprehensiveness and contextual robustness.

Technology Category

Application Category

📝 Abstract
Automated code review (CR) is a key application for Large Language Models (LLMs), but progress is hampered by a "reality gap": existing benchmarks evaluate models on isolated sub-tasks using simplified, context-poor data. This fails to reflect the holistic context-rich nature of real-world CR. To bridge this gap, we introduce CodeFuse-CR-Bench, the first comprehensiveness-aware benchmark for repository-level CR evaluation. CodeFuse-CR-Bench comprises 601 high-quality instances from 70 Python projects covering nine Pull-Request (PR) problem domains, where each instance provides rich, multi-faceted context including the associated issue, PR details, and repository state, enabling end-to-end evaluation. Beyond superficial metrics, we also propose a novel evaluation framework that combines rule-based checks for location and syntax with model-based judgments of review quality. We present the first large-scale assessment of state-of-the-art LLMs on this comprehensive CR task. Our results establish crucial baselines and reveal that (1) no single LLM dominates all aspects of CR; (2) Gemini 2.5 Pro achieves the highest comprehensive performance; and (3) different LLMs exhibit varying robustness to redundant context. These findings highlight the necessity of holistic, multi-dimensional evaluation and provide actionable insights for advancing truly intelligent yet practical CR assistants.
Problem

Research questions and friction points this paper is trying to address.

Bridging the reality gap in automated code review benchmarks
Evaluating LLMs on holistic repository-level code review tasks
Assessing comprehensiveness-aware code review across multiple problem domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Repository-level benchmark for code review
Combines rule-based and model-based evaluation
Evaluates LLMs on comprehensive context-rich data