Deprecating Benchmarks: Criteria and Framework

📅 2025-07-08

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Rapid advancements in frontier AI models render existing benchmarks obsolete quickly, leading to inflated capability estimates, obscured safety risks, and distorted evaluations. To address this, we introduce the first systematic framework for benchmark deprecation—defining formal criteria and a tiered implementation protocol that distinguishes full from partial deprecation. Our decision model is grounded in three empirically grounded dimensions: task specificity, model saturation, and metric misleadingness, enabling actionable, evidence-based deprecation judgments. Through a comprehensive review of prevailing benchmarking practices across major AI evaluation initiatives, we deliver a transparent, robust lifecycle management guide tailored for developers, practitioners, governance bodies, and policymakers. Our framework significantly enhances evaluation rigor and safety assurance, fostering a more standardized, adaptive, and responsible AI evaluation ecosystem. (149 words)

Technology Category

Application Category

📝 Abstract

As frontier artificial intelligence (AI) models rapidly advance, benchmarks are integral to comparing different models and measuring their progress in different task-specific domains. However, there is a lack of guidance on when and how benchmarks should be deprecated once they cease to effectively perform their purpose. This risks benchmark scores over-valuing model capabilities, or worse, obscuring capabilities and safety-washing. Based on a review of benchmarking practices, we propose criteria to decide when to fully or partially deprecate benchmarks, and a framework for deprecating benchmarks. Our work aims to advance the state of benchmarking towards rigorous and quality evaluations, especially for frontier models, and our recommendations are aimed to benefit benchmark developers, benchmark users, AI governance actors (across governments, academia, and industry panels), and policy makers.

Problem

Research questions and friction points this paper is trying to address.

Lack of guidelines for deprecating outdated AI benchmarks

Risk of overvaluing model capabilities with obsolete benchmarks

Need for a framework to ensure rigorous AI evaluations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes criteria for benchmark deprecation

Introduces framework for deprecating benchmarks

Aims to improve AI evaluation rigor

🔎 Similar Papers

Is Your LLM Outdated? A Deep Look at Temporal Generalization