A Survey of Code Review Benchmarks and Evaluation Practices in Pre-LLM and LLM Era

πŸ“… 2026-02-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing code review benchmarks lack systematicity, resulting in fragmented datasets and inconsistent evaluation criteria that hinder accurate assessment of model capabilities. This work presents a systematic literature review of 99 studies from 2015 to 2025 and introduces the first multi-level task taxonomy for code review, encompassing five major domains and eighteen fine-grained tasks. Through meta-analysis, it contrasts research evolution between the pre-LLM and LLM eras, revealing a paradigm shift from change understanding to end-to-end generative review. The study identifies critical challenges such as insufficient multilingual support and the absence of dynamic runtime evaluation, thereby establishing a theoretical foundation and offering clear directions for developing more realistic and comprehensive LLM-based code review benchmarks.

Technology Category

Application Category

πŸ“ Abstract
Code review is a critical practice in modern software engineering, helping developers detect defects early, improve code quality, and facilitate knowledge sharing. With the rapid advancement of large language models (LLMs), a growing body of work has explored automated support for code review. However, progress in this area is hindered by the lack of a systematic understanding of existing benchmarks and evaluation practices. Current code review datasets are scattered, vary widely in design, and provide limited insight into what review capabilities are actually being assessed. In this paper, we present a comprehensive survey of code review benchmarks spanning both the Pre-LLM and LLM eras (2015--2025). We analyze 99 research papers (58 Pre-LLM era and 41 LLM era) and extract key metadata, including datasets, evaluation metrics, data sources, and target tasks. Based on this analysis, we propose a multi-level taxonomy that organizes code review research into five domains and 18 fine-grained tasks. Our study reveals a clear shift toward end-to-end generative peer review, increasing multilingual coverage, and a decline in standalone change understanding tasks. We further identify limitations of current benchmarks and outline future directions, including broader task coverage, dynamic runtime evaluation, and taxonomy-guided fine-grained assessment. This survey provides a structured foundation for developing more realistic and comprehensive benchmarks for LLM-based code review.
Problem

Research questions and friction points this paper is trying to address.

code review
benchmarks
evaluation practices
large language models
software engineering
Innovation

Methods, ideas, or system contributions that make the work stand out.

code review
large language models
benchmark taxonomy
evaluation practices
generative peer review