Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code review benchmarks suffer from three key limitations: insufficient semantic context (e.g., issue descriptions), high data noise, and coarse granularity (file- or commit-level only). To address these, we propose ContextCRBench—the first fine-grained benchmark integrating issue descriptions with function- and class-level complete code contexts, supporting line-level defect localization, code block quality assessment, and comment generation. We introduce a novel “crawl–extract–multi-stage filtering” construction pipeline, combining rule-based and LLM-driven data cleaning, and inject developer intent via issue–pull request alignment. The final benchmark comprises 67,910 high-quality samples. Evaluation across eight mainstream LLMs demonstrates that semantic context significantly improves model performance. When deployed at ByteDance, our benchmark enhanced an industrial self-evolving code review system by 61.98%.

Technology Category

Application Category

📝 Abstract
Code review is a cornerstone of software quality assurance, and recent advances in Large Language Models (LLMs) have shown promise in automating this process. However, existing benchmarks for LLM-based code review face three major limitations. (1) Lack of semantic context: most benchmarks provide only code diffs without textual information such as issue descriptions, which are crucial for understanding developer intent. (2) Data quality issues: without rigorous validation, many samples are noisy-e.g., reviews on outdated or irrelevant code-reducing evaluation reliability. (3) Coarse granularity: most benchmarks operate at the file or commit level, overlooking the fine-grained, line-level reasoning essential for precise review. We introduce ContextCRBench, a high-quality, context-rich benchmark for fine-grained LLM evaluation in code review. Our construction pipeline comprises: (1) Raw Data Crawling, collecting 153.7K issues and pull requests from top-tier repositories; (2) Comprehensive Context Extraction, linking issue-PR pairs for textual context and extracting the full surrounding function or class for code context; and (3) Multi-stage Data Filtering, combining rule-based and LLM-based validation to remove outdated, malformed, or low-value samples, resulting in 67,910 context-enriched entries. ContextCRBench supports three evaluation scenarios aligned with the review workflow: (1) hunk-level quality assessment, (2) line-level defect localization, and (3) line-level comment generation. Evaluating eight leading LLMs (four closed-source and four open-source) reveals that textual context yields greater performance gains than code context alone, while current LLMs remain far from human-level review ability. Deployed at ByteDance, ContextCRBench drives a self-evolving code review system, improving performance by 61.98% and demonstrating its robustness and industrial utility.
Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks lack semantic context like issue descriptions for code review
Current benchmarks suffer from data quality issues with noisy samples
Most code review benchmarks operate at coarse granularity levels
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enriched benchmark with issue-PR pairs for context
Multi-stage filtering combines rules and LLM validation
Fine-grained evaluation at hunk and line levels
🔎 Similar Papers
No similar papers found.
Ruida Hu
Ruida Hu
Harbin Institute of Technology, Shenzhen
software engineeringLLM agent
Xinchen Wang
Xinchen Wang
Harbin Institute of Technology
AI4SECode Intelligence
X
Xinjie Wen
Harbin Institute of Technology, Shenzhen, China
Z
Zhao Zhang
ByteDance, Beijing, China
B
Bo Jiang
ByteDance, Beijing, China
P
Pengfei Gao
ByteDance, Beijing, China
C
Chao Peng
ByteDance, Beijing, China
C
Cuiyun Gao
Harbin Institute of Technology, Shenzhen, China