CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

📅 2025-02-23

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Existing code critique evaluation benchmarks suffer from task narrowness, incomplete assessment dimensions, and insufficient coverage of difficulty levels. This paper introduces CodeCritique, the first comprehensive benchmark for evaluating large language models’ (LLMs’) code critique capabilities. It spans two core tasks—code generation and code question answering—and supports multi-level difficulty queries alongside dual-tier (basic and advanced) critique assessment. Innovatively, it proposes a structured checklist-driven evaluation framework, overcoming the limitation of conventional benchmarks that rely solely on raw generated outputs. High-quality test cases are curated from HumanEval, MBPP, and other sources, integrated with standardized prompting and rigorous human validation. Empirical evaluation across mainstream LLMs reveals significant capability gaps in complex critique tasks, providing a new quantitative benchmark and empirical foundation for measuring, aligning, and improving code critique proficiency.

Technology Category

Application Category

📝 Abstract

The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of LLMs has drawn great attention and several critique benchmarks have been proposed. However, existing critique benchmarks usually have the following limitations: (1). Focusing on diverse reasoning tasks in general domains and insufficient evaluation on code tasks (e.g., only covering code generation task), where the difficulty of queries is relatively easy (e.g., the code queries of CriticBench are from Humaneval and MBPP). (2). Lacking comprehensive evaluation from different dimensions. To address these limitations, we introduce a holistic code critique benchmark for LLMs called CodeCriticBench. Specifically, our CodeCriticBench includes two mainstream code tasks (i.e., code generation and code QA) with different difficulties. Besides, the evaluation protocols include basic critique evaluation and advanced critique evaluation for different characteristics, where fine-grained evaluation checklists are well-designed for advanced settings. Finally, we conduct extensive experimental results of existing LLMs, which show the effectiveness of CodeCriticBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating critique capacity of LLMs

Addressing limitations in code critique benchmarks

Introducing CodeCriticBench for comprehensive assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Holistic code critique benchmark

Includes code generation and QA

Advanced fine-grained evaluation protocols

🔎 Similar Papers

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study