Measuring what Matters: Construct Validity in Large Language Model Benchmarks

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Current LLM evaluation benchmarks suffer from insufficient construct validity—particularly for abstract constructs such as “safety” and “robustness”—due to widespread construct–task misalignment across task design, phenomenon definition, and scoring metrics. Method: We systematically reviewed 445 benchmarks from top-tier conferences (ACL, EMNLP, NeurIPS), identifying eight recurrent validity threat patterns through expert-guided, systematic literature review. Contribution/Results: We propose, for the first time from a construct validity perspective, eight actionable benchmark design principles and an accompanying validation guideline. These provide a theoretical framework and empirical foundation for enhancing the scientific rigor and result reliability of LLM evaluation. Our work fills a critical methodological gap in LLM assessment by establishing a systematic validity verification paradigm, thereby shifting benchmark development from empirically driven practice toward validity-driven science.

Technology Category

Application Category

📝 Abstract

Evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment. Reliably measuring abstract and complex phenomena such as'safety'and'robustness'requires strong construct validity, that is, having measures that represent what matters to the phenomenon. With a team of 29 expert reviewers, we conduct a systematic review of 445 LLM benchmarks from leading conferences in natural language processing and machine learning. Across the reviewed articles, we find patterns related to the measured phenomena, tasks, and scoring metrics which undermine the validity of the resulting claims. To address these shortcomings, we provide eight key recommendations and detailed actionable guidance to researchers and practitioners in developing LLM benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Assessing construct validity issues in LLM benchmark evaluations

Identifying flawed measurement patterns for safety and robustness

Providing actionable guidance for developing valid LLM benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic review of 445 LLM benchmarks

Expert analysis of construct validity patterns

Eight actionable recommendations for benchmark development

🔎 Similar Papers

No similar papers found.