🤖 AI Summary
As large language models rapidly advance, AI benchmarks are quickly saturating, making it increasingly difficult to differentiate top-performing models over time. This study systematically examines saturation across 60 benchmarks, identifying 14 attributes spanning task design, data construction, and evaluation format. Through quantitative modeling and cross-temporal trend analysis, the work presents the first large-scale empirical evidence of the prevalence and evolution of benchmark saturation. The findings reveal that expert-curated benchmarks exhibit greater resistance to saturation compared to crowdsourced ones, while public availability of test sets shows no significant effect in delaying saturation. Nearly half of the examined benchmarks have already saturated, with this proportion rising over time. These results provide critical empirical insights and design principles for developing durable and robust AI evaluation frameworks.
📝 Abstract
Artificial Intelligence (AI) benchmarks play a central role in measuring progress in model development and guiding deployment decisions. However, many benchmarks quickly become saturated, meaning that they can no longer differentiate between the best-performing models, diminishing their long-term value. In this study, we analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers. To identify factors driving saturation, we characterize benchmarks along 14 properties spanning task design, data construction, and evaluation format. We test five hypotheses examining how each property contributes to saturation rates. Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age. Notably, hiding test data (i.e., public vs. private) shows no protective effect, while expert-curated benchmarks resist saturation better than crowdsourced ones. Our findings highlight which design choices extend benchmark longevity and inform strategies for more durable evaluation.