Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing code review benchmarks suffer from three key limitations: insufficient semantic context (e.g., issue descriptions), high data noise, and coarse granularity (file- or commit-level only). To address these, we propose ContextCRBench—the first fine-grained benchmark integrating issue descriptions with function- and class-level complete code contexts, supporting line-level defect localization, code block quality assessment, and comment generation. We introduce a novel “crawl–extract–multi-stage filtering” construction pipeline, combining rule-based and LLM-driven data cleaning, and inject developer intent via issue–pull request alignment. The final benchmark comprises 67,910 high-quality samples. Evaluation across eight mainstream LLMs demonstrates that semantic context significantly improves model performance. When deployed at ByteDance, our benchmark enhanced an industrial self-evolving code review system by 61.98%.

Technology Category

Application Category

📝 Abstract

Code review is a cornerstone of software quality assurance, and recent advances in Large Language Models (LLMs) have shown promise in automating this process. However, existing benchmarks for LLM-based code review face three major limitations. (1) Lack of semantic context: most benchmarks provide only code diffs without textual information such as issue descriptions, which are crucial for understanding developer intent. (2) Data quality issues: without rigorous validation, many samples are noisy-e.g., reviews on outdated or irrelevant code-reducing evaluation reliability. (3) Coarse granularity: most benchmarks operate at the file or commit level, overlooking the fine-grained, line-level reasoning essential for precise review. We introduce ContextCRBench, a high-quality, context-rich benchmark for fine-grained LLM evaluation in code review. Our construction pipeline comprises: (1) Raw Data Crawling, collecting 153.7K issues and pull requests from top-tier repositories; (2) Comprehensive Context Extraction, linking issue-PR pairs for textual context and extracting the full surrounding function or class for code context; and (3) Multi-stage Data Filtering, combining rule-based and LLM-based validation to remove outdated, malformed, or low-value samples, resulting in 67,910 context-enriched entries. ContextCRBench supports three evaluation scenarios aligned with the review workflow: (1) hunk-level quality assessment, (2) line-level defect localization, and (3) line-level comment generation. Evaluating eight leading LLMs (four closed-source and four open-source) reveals that textual context yields greater performance gains than code context alone, while current LLMs remain far from human-level review ability. Deployed at ByteDance, ContextCRBench drives a self-evolving code review system, improving performance by 61.98% and demonstrating its robustness and industrial utility.

Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks lack semantic context like issue descriptions for code review

Current benchmarks suffer from data quality issues with noisy samples

Most code review benchmarks operate at coarse granularity levels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enriched benchmark with issue-PR pairs for context

Multi-stage filtering combines rules and LLM validation

Fine-grained evaluation at hunk and line levels

🔎 Similar Papers

CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells