Is Your Benchmark (Still) Useful? Dynamic Benchmarking for Code Language Models

📅 2025-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Static benchmarks for code large language models (LLMs) suffer from training data contamination, leading to inflated and unreliable performance estimates. Method: This paper introduces the first dynamic contamination-resistant benchmarking framework. It automatically generates syntactically novel yet semantically equivalent test cases via semantics-preserving program mutations—such as variable renaming and control-flow–equivalent transformations—followed by formal consistency verification to guarantee functional equivalence. Contribution/Results: Evaluated on ten state-of-the-art code LLMs, the framework induces substantial performance degradation (average drop >40%), revealing severe overestimation of model capabilities by static benchmarks and triggering significant rank reversals among models. This work pioneers dynamic benchmark reconstruction, establishing a new paradigm for trustworthy, contamination-aware evaluation of code LLMs.

Technology Category

Application Category

📝 Abstract
In this paper, we tackle a critical challenge in model evaluation: how to keep code benchmarks useful when models might have already seen them during training. We introduce a novel solution, dynamic benchmarking framework, to address this challenge. Given a code understanding or reasoning benchmark, our framework dynamically transforms each input, i.e., programs, with various semantic-preserving mutations to build a syntactically new while semantically identical benchmark. We evaluated ten popular language models on our dynamic benchmarks. Our evaluation reveals several interesting or surprising findings: (1) all models perform significantly worse than before, (2) the ranking between some models shifts dramatically, and (3) our dynamic benchmarks can resist against the data contamination problem.
Problem

Research questions and friction points this paper is trying to address.

Dynamic benchmarking for code language models evaluation
Addressing data contamination in model training benchmarks
Semantic-preserving mutations to create new, useful benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic benchmarking framework for code models
Semantic-preserving mutations on program inputs
Resists data contamination in model evaluation
B
Batu Guan
The Chinese University of Hong Kong
X
Xiao Wu
Huazhong University of Science and Technology
Yuanyuan Yuan
Yuanyuan Yuan
ETH Zurich
Security
S
Shaohua Li
The Chinese University of Hong Kong