🤖 AI Summary
Static benchmarks for code large language models (LLMs) suffer from training data contamination, leading to inflated and unreliable performance estimates. Method: This paper introduces the first dynamic contamination-resistant benchmarking framework. It automatically generates syntactically novel yet semantically equivalent test cases via semantics-preserving program mutations—such as variable renaming and control-flow–equivalent transformations—followed by formal consistency verification to guarantee functional equivalence. Contribution/Results: Evaluated on ten state-of-the-art code LLMs, the framework induces substantial performance degradation (average drop >40%), revealing severe overestimation of model capabilities by static benchmarks and triggering significant rank reversals among models. This work pioneers dynamic benchmark reconstruction, establishing a new paradigm for trustworthy, contamination-aware evaluation of code LLMs.
📝 Abstract
In this paper, we tackle a critical challenge in model evaluation: how to keep code benchmarks useful when models might have already seen them during training. We introduce a novel solution, dynamic benchmarking framework, to address this challenge. Given a code understanding or reasoning benchmark, our framework dynamically transforms each input, i.e., programs, with various semantic-preserving mutations to build a syntactically new while semantically identical benchmark. We evaluated ten popular language models on our dynamic benchmarks. Our evaluation reveals several interesting or surprising findings: (1) all models perform significantly worse than before, (2) the ranking between some models shifts dramatically, and (3) our dynamic benchmarks can resist against the data contamination problem.