Position: AI Evaluation Should Learn from How We Test Humans

📅 2023-06-18
📈 Citations: 21
Influential: 2
📄 PDF
🤖 AI Summary
Current AI evaluation faces challenges including high annotation costs, training-evaluation data contamination, and low-quality test items—leading to insufficient reliability and validity. This paper pioneers the systematic integration of classical psychometrics—particularly Item Response Theory (IRT)—into AI capability assessment, proposing an IRT-based adaptive testing paradigm. By jointly modeling item parameters and model abilities, it enables dynamic item selection, personalized measurement, and interpretable evaluation. Unlike static benchmark suites, our approach supports real-time ability estimation and online item parameter calibration. Experiments demonstrate that, compared to conventional benchmarks, our paradigm reduces annotation costs by over 40%, mitigates data contamination between training and evaluation, and significantly improves both reliability and validity. This work establishes a theoretical foundation and technical pathway for building robust, efficient, and scalable next-generation AI evaluation frameworks.
📝 Abstract
As AI systems continue to evolve, their rigorous evaluation becomes crucial for their development and deployment. Researchers have constructed various large-scale benchmarks to determine their capabilities, typically against a gold-standard test set and report metrics averaged across all items. However, this static evaluation paradigm increasingly shows its limitations, including high evaluation costs, data contamination, and the impact of low-quality or erroneous items on evaluation reliability and efficiency. In this Position, drawing from human psychometrics, we discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics or value of each test item in the benchmark, and tailoring each model's evaluation instead of relying on a fixed test set. This paradigm provides robust ability estimation, uncovering the latent traits underlying a model's observed scores. This position paper analyze the current possibilities, prospects, and reasons for adopting psychometrics in AI evaluation. We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
Problem

Research questions and friction points this paper is trying to address.

Static AI evaluation methods have high costs and reliability issues
Adaptive testing can improve AI evaluation robustness and efficiency
Psychometrics offers solutions for modern AI assessment challenges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shift from static to adaptive testing paradigm
Estimate test item characteristics for tailored evaluation
Apply psychometrics theory to AI evaluation challenges
🔎 Similar Papers
No similar papers found.
Y
Yan Zhuang
University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence
Q
Qi Liu
University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence
Yuting Ning
Yuting Ning
The Ohio State University
Natural Language Processing
W
Wei Huang
University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence
R
Rui Lv
University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence
Zhenya Huang
Zhenya Huang
University of Science and Technology of China
Data ScienceAIKnowledge RepresentationCognitive ReasoningIntelligent Education
Guanhao Zhao
Guanhao Zhao
University of Science & Technology of China
Data MiningDiffusion ModelComputerized Adaptive Testing
Z
Zheng Zhang
University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence
Qingyang Mao
Qingyang Mao
University of Science and Technology of China
Table ReasoningCross-domain Transfer LearningVisual Generation
Shijin Wang
Shijin Wang
Tongji University
Schedulingmaintenance
Enhong Chen
Enhong Chen
University of Science and Technology of China
data miningrecommender systemmachine learning