From Human-Level AI Tales to AI Leveling Human Scales

📅 2026-02-21
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Current AI models are often inaccurately described as achieving “human-level” performance due to the absence of evaluation benchmarks grounded in the global distribution of human capabilities. This work proposes a human-anchored evaluation framework that leverages large-scale international assessments—such as PISA and TIMSS—and integrates stratified sampling with post-stratification techniques to extrapolate ability distributions across diverse populations. The framework constructs a multidimensional logit scale, enabling, for the first time, direct comparison of AI performance against the full spectrum of human abilities on a unified, recalibratable metric. Consequently, AI performance can be objectively expressed as the log-odds corresponding to the probability of success among the global human population, establishing a standardized, cross-task, and cross-population evaluation benchmark.

Technology Category

Application Category

📝 Abstract
Comparing AI models to "human level" is often misleading when benchmark scores are incommensurate or human baselines are drawn from a narrow population. To address this, we propose a framework that calibrates items against the 'world population' and report performance on a common, human-anchored scale. Concretely, we build on a set of multi-level scales for different capabilities where each level should represent a probability of success of the whole world population on a logarithmic scale with a base $B$. We calibrate each scale for each capability (reasoning, comprehension, knowledge, volume, etc.) by compiling publicly released human test data spanning education and reasoning benchmarks (PISA, TIMSS, ICAR, UKBioBank, and ReliabilityBench). The base $B$ is estimated by extrapolating between samples with two demographic profiles using LLMs, with the hypothesis that they condense rich information about human populations. We evaluate the quality of different mappings using group slicing and post-stratification. The new techniques allow for the recalibration and standardization of scales relative to the whole-world population.
Problem

Research questions and friction points this paper is trying to address.

human-level AI
benchmarking
population calibration
scale standardization
AI evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

human-anchored scale
population calibration
logarithmic proficiency scale
large language models
post-stratification
🔎 Similar Papers
No similar papers found.
Peter Romero
Peter Romero
Universidad Politècnica de València, University of Cambridge
People AnalyticsPsychometricsDeep LearningAlgebraic TopologyCybernetics
Fernando MartĂ­nez-Plumed
Fernando MartĂ­nez-Plumed
VRAIN, Valencian Research Institute for Artificial Intelligence, Universitat Politecnica de Valencia
Artificial IntelligenceMachine LearningAI evaluationItem Response Theory
Z
Zachary R. Tyler
Georgia Institute of Technology
M
Matthieu TĂŠhĂŠnan
University of Cambridge, Department of Computer Sciences and Technology
Sipeng Chen
Sipeng Chen
Florida State University
Machine Learning
Á
Álvaro David Gómez Antón
Valencian Research Institute of Artificial Intelligence, Universitat Politècnica de València, Valencia, Spain
Luning Sun
Luning Sun
Lawrence Livermore National Lab
AI for ScienceScientific Machine LearningUncertainty QuantificationCFDVariational Inference
Manuel Cebrian
Manuel Cebrian
Spanish National Research Council
Computational Social ScienceArtificial Intelligence
L
Lexin Zhou
Department of Computer Science, Princeton University
Y
Yael Moros Daval
Valencian Research Institute of Artificial Intelligence, Universitat Politècnica de València, Valencia, Spain
D
Daniel Romero-Alvarado
Valencian Research Institute of Artificial Intelligence, Universitat Politècnica de València, Valencia, Spain
F
FĂŠlix MartĂ­ PĂŠrez
Valencian Research Institute of Artificial Intelligence, Universitat Politècnica de València, Valencia, Spain
Kevin Wei
Kevin Wei
Assistant Professor of Medicine, Harvard Medical School, Brigham and Women's Hospital
inflammationfibroblaststromal cellssingle-cell genomics
JosĂŠ HernĂĄndez-Orallo
JosĂŠ HernĂĄndez-Orallo
University of Cambridge, VRAIN-UPV
Artificial IntelligenceData ScienceIntelligenceAI EvaluationAI Safety