Is Your Benchmark (Still) Useful? Dynamic Benchmarking for Code Language Models

📅 2025-03-09

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

Static benchmarks for code large language models (LLMs) suffer from training data contamination, leading to inflated and unreliable performance estimates. Method: This paper introduces the first dynamic contamination-resistant benchmarking framework. It automatically generates syntactically novel yet semantically equivalent test cases via semantics-preserving program mutations—such as variable renaming and control-flow–equivalent transformations—followed by formal consistency verification to guarantee functional equivalence. Contribution/Results: Evaluated on ten state-of-the-art code LLMs, the framework induces substantial performance degradation (average drop >40%), revealing severe overestimation of model capabilities by static benchmarks and triggering significant rank reversals among models. This work pioneers dynamic benchmark reconstruction, establishing a new paradigm for trustworthy, contamination-aware evaluation of code LLMs.

Technology Category

Application Category

📝 Abstract

In this paper, we tackle a critical challenge in model evaluation: how to keep code benchmarks useful when models might have already seen them during training. We introduce a novel solution, dynamic benchmarking framework, to address this challenge. Given a code understanding or reasoning benchmark, our framework dynamically transforms each input, i.e., programs, with various semantic-preserving mutations to build a syntactically new while semantically identical benchmark. We evaluated ten popular language models on our dynamic benchmarks. Our evaluation reveals several interesting or surprising findings: (1) all models perform significantly worse than before, (2) the ranking between some models shifts dramatically, and (3) our dynamic benchmarks can resist against the data contamination problem.

Problem

Research questions and friction points this paper is trying to address.

Dynamic benchmarking for code language models evaluation

Addressing data contamination in model training benchmarks

Semantic-preserving mutations to create new, useful benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic benchmarking framework for code models

Semantic-preserving mutations on program inputs

Resists data contamination in model evaluation

🔎 Similar Papers

Is Your LLM Outdated? A Deep Look at Temporal Generalization