General Scales Unlock AI Evaluation with Explanatory and Predictive Power

📅 2025-03-09
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Existing AI evaluation methods lack interpretability and fail to predict general-purpose AI performance on novel tasks. This paper introduces the first generalizable, non-saturating capability scaling framework for general AI, built upon 18 automated assessment metrics. It constructs dual spectra—task requirements and model capabilities—enabling multidimensional capability disentanglement (knowledge, metacognition, reasoning) and unsupervised capability profiling. Crucially, it enables instance-level performance prediction across tasks and data distributions—a first—and reveals the sensitivity and specificity mechanisms underlying benchmark design. Experiments across 15 large language models and 63 diverse tasks demonstrate that our method reduces out-of-distribution (OOD) instance-level prediction error by 37%, significantly outperforming embedding- and fine-tuning-based baselines. It substantially enhances both the interpretability and generalizability of AI evaluation.

Technology Category

Application Category

📝 Abstract
Ensuring safe and effective use of AI requires understanding and anticipating its performance on novel tasks, from advanced scientific challenges to transformed workplace activities. So far, benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems, given the low transferability across diverse tasks. In this paper, we introduce general scales for AI evaluation that can explain what common AI benchmarks really measure, extract ability profiles of AI systems, and predict their performance for new task instances, in- and out-of-distribution. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate. Illustrated for 15 large language models and 63 tasks, high explanatory power is unleashed from inspecting the demand and ability profiles, bringing insights on the sensitivity and specificity exhibited by different benchmarks, and how knowledge, metacognition and reasoning are affected by model size, chain-of-thought and distillation. Surprisingly, high predictive power at the instance level becomes possible using these demand levels, providing superior estimates over black-box baseline predictors based on embeddings or finetuning, especially in out-of-distribution settings (new tasks and new benchmarks). The scales, rubrics, battery, techniques and results presented here represent a major step for AI evaluation, underpinning the reliable deployment of AI in the years ahead.
Problem

Research questions and friction points this paper is trying to address.

Develop general scales for AI evaluation
Predict AI performance on new tasks
Enhance explanatory power of AI benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

General scales for AI evaluation
Automated methodology with 18 rubrics
Predictive power for new tasks
🔎 Similar Papers
No similar papers found.
L
Lexin Zhou
Leverhulme Centre for the Future of Intelligence, University of Cambridge, UK; Microsoft Research Asia; Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València, Spain
Lorenzo Pacchiardi
Lorenzo Pacchiardi
Research Associate, University of Cambridge
Large Language ModelsAI evaluationAI policyBayesian InferenceLikelihood-Free Inference
Fernando MartĂ­nez-Plumed
Fernando MartĂ­nez-Plumed
VRAIN, Valencian Research Institute for Artificial Intelligence, Universitat Politecnica de Valencia
Artificial IntelligenceMachine LearningAI evaluationItem Response Theory
Katherine M. Collins
Katherine M. Collins
Machine Learning PhD Student at the University of Cambridge
Cognitive ScienceMachine LearningBayesian StatisticsHuman-AI Interaction
Y
Yael Moros-Daval
Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València, Spain
S
Seraphina Zhang
Leverhulme Centre for the Future of Intelligence, University of Cambridge, UK; Department of Psychology, University of Cambridge, UK
Q
Qinlin Zhao
Microsoft Research Asia
Y
Yitian Huang
Microsoft Research Asia
Luning Sun
Luning Sun
Lawrence Livermore National Lab
AI for ScienceScientific Machine LearningUncertainty QuantificationCFDVariational Inference
J
Jonathan E. Prunty
Leverhulme Centre for the Future of Intelligence, University of Cambridge, UK
Z
Zongqian Li
Department of Theoretical and Applied Linguistics, University of Cambridge, UK
P
Pablo SĂĄnchez-GarcĂ­a
KU Leuven, Belgium
K
Kexin Jiang Chen
Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València, Spain
P
Pablo A. M. Casares
Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València, Spain
J
Jiyun Zu
Educational Testing Service, US
John Burden
John Burden
University of Cambridge
Reinforcement LearningArtificial IntelligenceLong-term AI SafetyAI Evaluation
B
Behzad Mehrbakhsh
Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València, Spain
David Stillwell
David Stillwell
Cambridge University
Psychologysocial networksdecision makingdigital footprints
Manuel Cebrian
Manuel Cebrian
Spanish National Research Council
Computational Social ScienceArtificial Intelligence
Jindong Wang
Jindong Wang
Assistant Professor, William & Mary; Ex Senior Researcher, Microsoft Research
machine learningtransfer learninglarge language modelsgenerative AI
Peter Henderson
Peter Henderson
Princeton University
Machine LearningLaw
Sherry Tongshuang Wu
Sherry Tongshuang Wu
Assistant Professor @ Carnegie Mellon University
Human-AI InteractionHuman Computer InteractionNatural Language Processing
P
Patrick C. Kyllonen
Educational Testing Service, US
Lucy Cheke
Lucy Cheke
Professor of Experimental Psychology, Department of Psychology, Cambridge
Episodic MemoryMemory DevelopmentMemory impairmentComparative CognitionCognition in AI
X
Xing Xie
Microsoft Research Asia
JosĂŠ HernĂĄndez-Orallo
JosĂŠ HernĂĄndez-Orallo
University of Cambridge, VRAIN-UPV
Artificial IntelligenceData ScienceIntelligenceAI EvaluationAI Safety