🤖 AI Summary
Rapid advancements in frontier AI models render existing benchmarks obsolete quickly, leading to inflated capability estimates, obscured safety risks, and distorted evaluations. To address this, we introduce the first systematic framework for benchmark deprecation—defining formal criteria and a tiered implementation protocol that distinguishes full from partial deprecation. Our decision model is grounded in three empirically grounded dimensions: task specificity, model saturation, and metric misleadingness, enabling actionable, evidence-based deprecation judgments. Through a comprehensive review of prevailing benchmarking practices across major AI evaluation initiatives, we deliver a transparent, robust lifecycle management guide tailored for developers, practitioners, governance bodies, and policymakers. Our framework significantly enhances evaluation rigor and safety assurance, fostering a more standardized, adaptive, and responsible AI evaluation ecosystem. (149 words)
📝 Abstract
As frontier artificial intelligence (AI) models rapidly advance, benchmarks are integral to comparing different models and measuring their progress in different task-specific domains. However, there is a lack of guidance on when and how benchmarks should be deprecated once they cease to effectively perform their purpose. This risks benchmark scores over-valuing model capabilities, or worse, obscuring capabilities and safety-washing. Based on a review of benchmarking practices, we propose criteria to decide when to fully or partially deprecate benchmarks, and a framework for deprecating benchmarks. Our work aims to advance the state of benchmarking towards rigorous and quality evaluations, especially for frontier models, and our recommendations are aimed to benefit benchmark developers, benchmark users, AI governance actors (across governments, academia, and industry panels), and policy makers.