General Scales Unlock AI Evaluation with Explanatory and Predictive Power

📅 2025-03-09

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing AI evaluation methods lack interpretability and fail to predict general-purpose AI performance on novel tasks. This paper introduces the first generalizable, non-saturating capability scaling framework for general AI, built upon 18 automated assessment metrics. It constructs dual spectra—task requirements and model capabilities—enabling multidimensional capability disentanglement (knowledge, metacognition, reasoning) and unsupervised capability profiling. Crucially, it enables instance-level performance prediction across tasks and data distributions—a first—and reveals the sensitivity and specificity mechanisms underlying benchmark design. Experiments across 15 large language models and 63 diverse tasks demonstrate that our method reduces out-of-distribution (OOD) instance-level prediction error by 37%, significantly outperforming embedding- and fine-tuning-based baselines. It substantially enhances both the interpretability and generalizability of AI evaluation.

Technology Category

Application Category

📝 Abstract

Ensuring safe and effective use of AI requires understanding and anticipating its performance on novel tasks, from advanced scientific challenges to transformed workplace activities. So far, benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems, given the low transferability across diverse tasks. In this paper, we introduce general scales for AI evaluation that can explain what common AI benchmarks really measure, extract ability profiles of AI systems, and predict their performance for new task instances, in- and out-of-distribution. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate. Illustrated for 15 large language models and 63 tasks, high explanatory power is unleashed from inspecting the demand and ability profiles, bringing insights on the sensitivity and specificity exhibited by different benchmarks, and how knowledge, metacognition and reasoning are affected by model size, chain-of-thought and distillation. Surprisingly, high predictive power at the instance level becomes possible using these demand levels, providing superior estimates over black-box baseline predictors based on embeddings or finetuning, especially in out-of-distribution settings (new tasks and new benchmarks). The scales, rubrics, battery, techniques and results presented here represent a major step for AI evaluation, underpinning the reliable deployment of AI in the years ahead.

Problem

Research questions and friction points this paper is trying to address.

Develop general scales for AI evaluation

Predict AI performance on new tasks

Enhance explanatory power of AI benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

General scales for AI evaluation

Automated methodology with 18 rubrics

Predictive power for new tasks

🔎 Similar Papers

No similar papers found.