🤖 AI Summary
Current AI evaluation suffers from distorted assessments and deployment risks due to overreliance on aggregate scores, which leads to ambiguous item selection, construct misalignment, and poor generalizability. This work proposes a paradigm shift by introducing item-level response data as the foundational infrastructure for evaluation. We present OpenEval, a large-scale open repository unifying 155,000 items and 10 million model responses. This framework enables fine-grained diagnostics of item quality, validation of construct alignment, and structural analysis of benchmarks, while mitigating data contamination and author burden. Empirically, we demonstrate that this evidence-driven, granular approach substantially enhances evaluation transparency, reproducibility, and auditability, thereby restoring validity to AI assessment practices.
📝 Abstract
AI evaluations have become the primary evidence for deploying generative AI systems across high-stakes domains. However, current evaluation paradigms often exhibit systemic validity failures. These issues, ranging from unjustified design choices to misaligned metrics, remain intractable without a principled framework for gathering validity evidence and conducting granular diagnostic analysis. In this position paper, we argue that item-level AI benchmark data is essential for establishing a rigorous science of AI evaluation. Item-level analysis enables fine-grained diagnostics and principled validation of benchmarks. We substantiate this position by dissecting current validity failures and revisiting evaluation paradigms across computer science and psychometrics. Through illustrative analyses of item properties and latent constructs, we demonstrate the unique insights afforded by item-level data. To catalyze community-wide adoption, we introduce OpenEval, a growing repository of item-level benchmark data designed supporting evidence-centered AI evaluation.