🤖 AI Summary
Static benchmarks suffer from data contamination, hindering reliable differentiation between model reasoning and memorization capabilities.
Method: This paper introduces a dynamic evaluation framework grounded in arXiv paper timestamps. It automatically synthesizes 1,643 multi-step reasoning questions to construct a longitudinal benchmark with a well-defined knowledge cutoff date. The synthesis follows a reasoning-driven paradigm to increase question complexity and mitigate shallow memorization biases. Evaluation employs temporal stratification and cross-vendor, multi-scale, multi-cutoff-date model comparisons.
Contribution/Results: Empirical results show no significant performance degradation in mainstream large language models after the knowledge cutoff—demonstrating strong robustness against data contamination. This validates the proposed synthetic methodology as a scalable, reasoning-oriented alternative to static benchmarks, providing both empirical evidence and methodological foundations for transitioning toward dynamic, extensible, and inference-focused evaluation paradigms.
📝 Abstract
Capability evaluation of large language models (LLMs) is increasingly shadowed by rising concerns of data contamination that cast doubts on whether static benchmarks measure genuine reasoning or mere memorization. We present an empirical study using an infinitely scalable framework to synthesize research-level QA directly from arXiv papers, harnessing the natural temporal structure of research publications where performance decay after knowledge cutoffs may indicate potential contamination. We evaluated 4 frontier model represented by 2 models of different knowledge cutoff dates per family on 1,643 multi-step reasoning questions synthesized from 20,277 arXiv papers stratified over 26 months, covering at least 6 months before and after all cutoff dates. Our results consistently showed a lack of significant performance decay near knowledge cutoff dates for models of various sizes, developers, and release dates. We further performed a comparative analysis with previous longitudinal studies that reported significant post-cutoff performance decay using directly retrieved questions based on public data. we hypothesize that the multi-step reasoning required by our synthesis pipeline offered additional complexity that goes deeper than shallow memorization, which effectively serves a mitigation strategy against benchmark contamination. We fully open source our code and dataset to aid reproducibility and advocate for a paradigm shift that prioritize reasoning-driven synthesis to construct benchmarks over simply collecting newly released questions periodically.