Dynamic Benchmark Construction for Evaluating Large Language Models on Real-World Codes

πŸ“… 2025-08-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing code generation benchmarks suffer from data contamination, inadequate test coverage, and lack of dynamic update mechanisms. To address these issues, this paper introduces CODE2BENCHβ€”a novel end-to-end dynamic evaluation framework. Methodologically, it features: (1) automated task construction via real-time GitHub repository harvesting and scope-graph-based dependency analysis; (2) function-level task categorization and property-based testing (PBT) generation ensuring 100% branch coverage; and (3) a multi-language, contamination-resistant mechanism for continuous benchmark evolution. Built upon this framework, the CODE2BENCH-2505 benchmark comprises 1,163 real-world programming tasks. It is the first to systematically expose critical deficiencies of state-of-the-art LLMs in complex logical reasoning and cross-language transfer capabilities. By establishing rigorous, reproducible, and evolving evaluation protocols, CODE2BENCH sets a new standard for empirical assessment of code generation models.

Technology Category

Application Category

πŸ“ Abstract
As large language models LLMs) become increasingly integrated into software development workflows, rigorously evaluating their performance on complex, real-world code generation tasks has become essential. However, existing benchmarks often suffer from data contamination and limited test rigor, constraining their ability to reveal model failures effectively. To address these, we present CODE2BENCH, a end-to-end pipeline for dynamically constructing robust and contamination-resistant benchmarks from real-world GitHub repositories. Specifically, CODE2BENCH introduces three key innovations: (1) Automated Dynamism, achieved through periodic ingestion of recent code to minimize training data contamination; (2) Scope Graph-based dependency analysis, which enables structured classification of functions into benchmark instances with controlled dependency levels (distinguishing between Self-Contained (SC) tasks for cross-language evaluation and Weakly Self-Contained (WSC) tasks involving permitted library usage); and (3) Property-Based Testing (PBT) for the automated synthesis of rigorous test suites to enable thorough functional verification. Using this pipeline, we construct CODE2BENCH-2505, the first benchmark derived from 880 recent Python projects spanning diverse domains, comprising 1,163 code generation tasks with 100% average branch coverage on ground-truth implementations. Extensive evaluation of 16 LLMs using CODE2BENCH-2505 reveals that models consistently struggle with SC tasks requiring complex, non-standard logic and cross-language transfer, while showing relatively stronger performance on WSC tasks in Python. Our work introduces a contamination-resistant, language-agnostic methodology for dynamic benchmark construction, offering a principled foundation for the comprehensive and realistic evaluation of LLMs on real-world software development tasks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on real-world code generation tasks effectively
Addressing data contamination in existing benchmarks for LLMs
Automating rigorous test suite synthesis for functional verification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated Dynamism for contamination-resistant benchmarks
Scope Graph-based dependency analysis for structured classification
Property-Based Testing for rigorous functional verification
πŸ”Ž Similar Papers
Z
Zhe Zhang
Beihang University
R
Runlin Liu
Beihang University
A
Aishan Liu
Beihang University
X
Xingyu Liu
Beihang University
X
Xiang Gao
Beihang University
Hailong Sun
Hailong Sun
Professor of Computer Science, Beihang University
Software EngineeringArtificial IntelligenceSoftware Systems