Recommendations and Reporting Checklist for Rigorous&Transparent Human Baselines in Model Evaluations

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current human baselines in large language model evaluation lack methodological rigor and transparency, undermining the validity of claims such as “superhuman performance.” Method: This paper pioneers the systematic integration of classical measurement theory into AI evaluation, establishing a comprehensive theoretical framework spanning human baseline design, execution, and reporting. It introduces an actionable quality assessment system and a standardized checklist, derived via meta-review–driven framework development, structured checklist design, and empirical systematic auditing. Contribution/Results: Applying this framework to diagnose 115 human baseline studies, we identify pervasive methodological flaws. The resulting open-source audit tool significantly enhances reproducibility, comparability, and accountability in AI evaluation. By grounding benchmarking practice in psychometric principles, our work provides a rigorous methodological foundation for scientifically credible model capability assessment.

Technology Category

Application Category

📝 Abstract
In this position paper, we argue that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance, and we provide recommendations and a reporting checklist towards this end. Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluations. Models are often claimed to achieve"super-human"performance, but existing baselining methods are neither sufficiently rigorous nor sufficiently well-documented to robustly measure and assess performance differences. Based on a meta-review of the measurement theory and AI evaluation literatures, we derive a framework with recommendations for designing, executing, and reporting human baselines. We synthesize our recommendations into a checklist that we use to systematically review 115 human baselines (studies) in foundation model evaluations and thus identify shortcomings in existing baselining methods; our checklist can also assist researchers in conducting human baselines and reporting results. We hope our work can advance more rigorous AI evaluation practices that can better serve both the research community and policymakers. Data is available at: https://github.com/kevinlwei/human-baselines
Problem

Research questions and friction points this paper is trying to address.

Improve rigor in human baselines for AI evaluations
Enhance transparency in reporting human vs AI performance
Address shortcomings in existing baselining methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework for rigorous human baseline design
Checklist for transparent baseline reporting
Meta-review of measurement theory literature
🔎 Similar Papers
No similar papers found.
Kevin L. Wei
Kevin L. Wei
RAND; Harvard Law School
AI evaluationAI safetyAI governanceprivate lawempirical legal studies
Patricia Paskov
Patricia Paskov
RAND
AI evaluationAI governanceeconomicsinternational development
S
Sunishchal Dev
RAND, Santa Monica, CA, USA; Algoverse
M
Michael J. Byun
RAND, Santa Monica, CA, USA; Independent
Anka Reuel
Anka Reuel
CS Ph.D. Candidate, Stanford University
AI GovernanceResponsible AIAI EthicsAI Safety
X
Xavier Roberts-Gaal
Harvard University, Cambridge, MA, USA
R
Rachel Calcott
Harvard University, Cambridge, MA, USA
E
Evie Coxon
Max Planck School of Cognition, Leipzig, Germany
C
Chinmay Deshpande
Center for Democracy & Technology, Washington, D.C., USA