🤖 AI Summary
Current LLM evaluation benchmarks suffer from insufficient construct validity—particularly for abstract constructs such as “safety” and “robustness”—due to widespread construct–task misalignment across task design, phenomenon definition, and scoring metrics. Method: We systematically reviewed 445 benchmarks from top-tier conferences (ACL, EMNLP, NeurIPS), identifying eight recurrent validity threat patterns through expert-guided, systematic literature review. Contribution/Results: We propose, for the first time from a construct validity perspective, eight actionable benchmark design principles and an accompanying validation guideline. These provide a theoretical framework and empirical foundation for enhancing the scientific rigor and result reliability of LLM evaluation. Our work fills a critical methodological gap in LLM assessment by establishing a systematic validity verification paradigm, thereby shifting benchmark development from empirically driven practice toward validity-driven science.
📝 Abstract
Evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment. Reliably measuring abstract and complex phenomena such as'safety'and'robustness'requires strong construct validity, that is, having measures that represent what matters to the phenomenon. With a team of 29 expert reviewers, we conduct a systematic review of 445 LLM benchmarks from leading conferences in natural language processing and machine learning. Across the reviewed articles, we find patterns related to the measured phenomena, tasks, and scoring metrics which undermine the validity of the resulting claims. To address these shortcomings, we provide eight key recommendations and detailed actionable guidance to researchers and practitioners in developing LLM benchmarks.