Measuring what Matters: Construct Validity in Large Language Model Benchmarks

📅 2025-11-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM evaluation benchmarks suffer from insufficient construct validity—particularly for abstract constructs such as “safety” and “robustness”—due to widespread construct–task misalignment across task design, phenomenon definition, and scoring metrics. Method: We systematically reviewed 445 benchmarks from top-tier conferences (ACL, EMNLP, NeurIPS), identifying eight recurrent validity threat patterns through expert-guided, systematic literature review. Contribution/Results: We propose, for the first time from a construct validity perspective, eight actionable benchmark design principles and an accompanying validation guideline. These provide a theoretical framework and empirical foundation for enhancing the scientific rigor and result reliability of LLM evaluation. Our work fills a critical methodological gap in LLM assessment by establishing a systematic validity verification paradigm, thereby shifting benchmark development from empirically driven practice toward validity-driven science.

Technology Category

Application Category

📝 Abstract
Evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment. Reliably measuring abstract and complex phenomena such as'safety'and'robustness'requires strong construct validity, that is, having measures that represent what matters to the phenomenon. With a team of 29 expert reviewers, we conduct a systematic review of 445 LLM benchmarks from leading conferences in natural language processing and machine learning. Across the reviewed articles, we find patterns related to the measured phenomena, tasks, and scoring metrics which undermine the validity of the resulting claims. To address these shortcomings, we provide eight key recommendations and detailed actionable guidance to researchers and practitioners in developing LLM benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Assessing construct validity issues in LLM benchmark evaluations
Identifying flawed measurement patterns for safety and robustness
Providing actionable guidance for developing valid LLM benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic review of 445 LLM benchmarks
Expert analysis of construct validity patterns
Eight actionable recommendations for benchmark development
🔎 Similar Papers
No similar papers found.
A
Andrew M. Bean
University of Oxford
R
R. Kearns
University of Oxford
Angelika Romanou
Angelika Romanou
EPFL
Natural Language ProcessingMachine LearningAI
Franziska Sofia Hafner
Franziska Sofia Hafner
Oxford Internet Institute
Artificial IntelligencePolicyBias
H
Harry Mayne
University of Oxford
J
Jan Batzner
Weizenbaum Institute Berlin, Technical University Munich
N
Negar Foroutan
EPFL
Chris Schmitz
Chris Schmitz
PhD Student, Centre for Digital Governance, Hertie School
Karolina Korgul
Karolina Korgul
Oxford Internet Institute, University of Oxford
AI SafetyAI AgentsEvals
Hunar Batra
Hunar Batra
University of Oxford
Machine LearningLanguage ModelsMultimodal AIReinforcement LearningAI Safety
O
Oishi Deb
University of Oxford
E
Emma Beharry
Stanford University
C
Cornelius Emde
University of Oxford
Thomas Foster
Thomas Foster
University of Oxford
A
Anna Gausen
UK AI Security Institute
M
Mar'ia Grandury
SomosNLP, Universdad Politécnica de Madrid
S
Simeng Han
Yale University
Valentin Hofmann
Valentin Hofmann
Allen Institute for AI & University of Washington
Natural Language ProcessingLarge Language ModelsComputational Linguistics
Lujain Ibrahim
Lujain Ibrahim
University of Oxford
human-AI interactionevaluationssocietal impact of AIsociotechnical AI
H
Hazel Kim
University of Oxford
Hannah Rose Kirk
Hannah Rose Kirk
University of Oxford
Large language modelsNLPEthics in AIAlignmentAI Safety
Fangru Lin
Fangru Lin
DPhil student at University of Oxford
Language ModellingEvaluationNeuro-symbolic methodsComputational LinguisticsLLM agent
G
Gabrielle Kaili-May Liu
Yale University
L
Lennart Luettgau
UK AI Security Institute
Jabez Magomere
Jabez Magomere
University of Oxford
Natural Language Processing
J
Jonathan Rystrøm
University of Oxford
A
Anna Sotnikova
EPFL
Yushi Yang
Yushi Yang
Stanford University
MEMSsensorsmeasurementfabrication
Y
Yilun Zhao
Yale University
Adel Bibi
Adel Bibi
University of Oxford
AI SafetyAI SecurityMachine Learning
A
A. Bosselut
EPFL
Ronald Clark
Ronald Clark
University of Oxford
Computer VisionRoboticsMachine LearningOptimisation
Arman Cohan
Arman Cohan
Yale University; Allen Institute for AI
Natural Language ProcessingMachine LearningArtificial Intelligence
Jakob Foerster
Jakob Foerster
Associate Professor, University of Oxford
Artificial Intelligence
Yarin Gal
Yarin Gal
Professor of Machine Learning, University of Oxford
Machine LearningArtificial IntelligenceProbability TheoryStatistics
Scott A. Hale
Scott A. Hale
Oxford Internet Institute, University of Oxford, Meedan, and the Alan Turing Institute
NLPComputational SociolinguisticsMachine Learning ApplicationsPolitical Mobilization
Inioluwa Deborah Raji
Inioluwa Deborah Raji
UC Berkeley
machine learningevaluationauditingalgorithmssociety
C
Chris Summerfield
University of Oxford, UK AI Security Institute
P
Philip H. S. Torr
University of Oxford
C
C. Ududec
UK AI Security Institute
Luc Rocher
Luc Rocher
Associate Professor, University of Oxford
PrivacyAlgorithm AuditingAlgorithmic FairnessMachine Learning
Adam Mahdi
Adam Mahdi
Associate Professor, University of Oxford
large language modelsmultimodal AIdigital health