CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark

📅 2025-07-04

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing code benchmarks are largely confined to single tasks (e.g., code generation or repair), failing to reflect the diversity and complexity of real-world software engineering scenarios; moreover, their test cases suffer from poor controllability and low reliability. Method: We propose the first configurable, repository-level, multi-scenario code evaluation benchmark, CoreBench, which employs an automated CorePipe pipeline to generate atomic and composite tasks spanning development, bug fixing, and test-driven development—while enabling hyperparameter-driven difficulty control. Contribution/Results: CoreBench significantly improves problem localization accuracy and test case reliability, bridging dual gaps in engineering process coverage breadth and evaluation flexibility. Extensive experiments across 16 state-of-the-art large language models validate its effectiveness and generalizability, establishing a novel paradigm for assessing LLMs’ practical applicability in real-world software engineering.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) demonstrate increasingly sophisticated code processing capabilities, evaluating their performance on engineering-level code remains challenging. Existing repository-level benchmarks primarily focus on single scenarios, such as code generation or bug fixing, without adequately capturing the diversity and complexity of real-world software or project engineering workflows. Furthermore, these benchmarks suffer from limited controllability in question positioning and reliability issues in their generated test cases. To address these limitations, we present CorePipe, a fully automated pipeline that converts repositories into comprehensive test cases, and introduce CoreCodeBench, a configurable multi-scenario repository-level benchmark. To simulate real engineering scenarios, CorePipe generates three types of atomic questions (Development, BugFix, and Test-Driven Development) specifically targeting core code segments. These atomic questions are further combined into three types of composite questions, with difficulty levels flexibly adjusted through hyperparameter tuning. CoreCodeBench provides a comprehensive and extensive repository-level benchmark to investigate the applicability of LLMs in real-world engineering projects. Experiments with 16 LLMs across diverse scenarios reveal varying capabilities and offer multi-dimensional insights into LLM performance in engineering contexts. The code for CorePipe is available at https://github.com/AGI-Eval-Official/CoreCodeBench, and the data for CoreCodeBench can be accessed at https://huggingface.co/collections/tubehhh/corecodebench-68256d2faabf4b1610a08caa.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on diverse real-world engineering code scenarios

Addressing limited controllability in repository-level benchmark questions

Improving reliability of test cases for code processing evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline converts repositories into test cases

Configurable multi-scenario benchmark for LLMs

Generates atomic and composite questions for evaluation

🔎 Similar Papers

CoIR: A Comprehensive Benchmark for Code Information Retrieval Models