🤖 AI Summary
Current AI evaluation faces challenges including high annotation costs, training-evaluation data contamination, and low-quality test items—leading to insufficient reliability and validity. This paper pioneers the systematic integration of classical psychometrics—particularly Item Response Theory (IRT)—into AI capability assessment, proposing an IRT-based adaptive testing paradigm. By jointly modeling item parameters and model abilities, it enables dynamic item selection, personalized measurement, and interpretable evaluation. Unlike static benchmark suites, our approach supports real-time ability estimation and online item parameter calibration. Experiments demonstrate that, compared to conventional benchmarks, our paradigm reduces annotation costs by over 40%, mitigates data contamination between training and evaluation, and significantly improves both reliability and validity. This work establishes a theoretical foundation and technical pathway for building robust, efficient, and scalable next-generation AI evaluation frameworks.
📝 Abstract
As AI systems continue to evolve, their rigorous evaluation becomes crucial for their development and deployment. Researchers have constructed various large-scale benchmarks to determine their capabilities, typically against a gold-standard test set and report metrics averaged across all items. However, this static evaluation paradigm increasingly shows its limitations, including high evaluation costs, data contamination, and the impact of low-quality or erroneous items on evaluation reliability and efficiency. In this Position, drawing from human psychometrics, we discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics or value of each test item in the benchmark, and tailoring each model's evaluation instead of relying on a fixed test set. This paradigm provides robust ability estimation, uncovering the latent traits underlying a model's observed scores. This position paper analyze the current possibilities, prospects, and reasons for adopting psychometrics in AI evaluation. We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.