Cost-Efficient Estimation of General Abilities Across Benchmarks

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenge of efficiently predicting the performance of large language models on unseen tasks under limited evaluation budgets. The authors propose a novel framework that integrates multidimensional item response theory (IRT) with an adaptive item selection strategy, introducing prediction validity as a primary evaluation criterion for the first time. Their approach incorporates a cost-aware discount factor and an optimal experimental design–driven item selection mechanism. Leveraging WILD—a newly constructed large-scale dataset of model–item responses—the method achieves performance prediction with a mean absolute error below 7% across 112 held-out tasks using only 16 observed items. This reduces the required token count from 141,000 to 22,000, cutting evaluation costs by 85%.

Technology Category

Application Category

📝 Abstract

Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities. This suggests the potential for more efficient and principled benchmarking, but it remains difficult to compare the quality of different methods. Motivated by predictive validity, we argue that the quality of a benchmarking framework should be grounded in how efficiently it enables the prediction of model performance on unseen tasks. To analyze this objective, we collect the "Wide-scale Item Level Dataset" (WILD), a dataset of item-model response pairs, comprising evaluations of 65 models on 109,564 unique items spanning 163 tasks drawn from 27 datasets. This dataset enables the first analysis of how different techniques can predict a model's performance on a large, diverse collection of unseen tasks under different budget constraints. We demonstrate that combining a modified multidimensional item response theory (IRT) model with adaptive item selection driven by optimal experimental design can predict performance on 112 held-out benchmark tasks with a mean absolute error (MAE) of less than 7%, and can do so after observing only 16 items. We further demonstrate that incorporating cost-aware discount factors into our selection criteria can reduce the total tokens needed to reach 7% MAE from 141,000 tokens to only 22,000, an 85% reduction in evaluation cost.

Problem

Research questions and friction points this paper is trying to address.

cost-efficient benchmarking

large language models

performance prediction

unseen tasks

evaluation cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

multidimensional item response theory

adaptive item selection

optimal experimental design