🤖 AI Summary
Current AI benchmarking overrelies on high-cost models, failing to reflect genuine progress in capability per unit cost. Method: We construct the first large-scale historical–contemporary AI benchmark pricing dataset by integrating Artificial Analysis and Epoch AI data, controlling for confounding effects of open-source model proliferation and hardware price declines, to quantify the evolution of inference cost per unit performance across knowledge, reasoning, mathematics, and software engineering tasks. Results: State-of-the-art models exhibit 5–10× annual reductions in inference cost, with algorithmic efficiency improvements contributing ~3×/year—surpassing hardware and licensing gains. We therefore propose “benchmark cost” as a core metric for assessing AI’s real-world impact. This work is the first to systematically isolate and quantify algorithmic efficiency as the dominant driver of AI’s economic advancement, thereby redefining how AI capability progress is measured.
📝 Abstract
Language models have seen enormous progress on advanced benchmarks in recent years, but much of this progress has only been possible by using more costly models. Benchmarks may therefore present a warped picture of progress in practical capabilities per dollar. To remedy this, we use data from Artificial Analysis and Epoch AI to form the largest dataset of current and historical prices to run benchmarks to date. We find that the price for a given level of benchmark performance has decreased remarkably fast, around $5 imes$ to $10 imes$ per year, for frontier models on knowledge, reasoning, math, and software engineering benchmarks. These reductions in the cost of AI inference are due to economic forces, hardware efficiency improvements, and algorithmic efficiency improvements. Isolating out open models to control for competition effects and dividing by hardware price declines, we estimate that algorithmic efficiency progress is around $3 imes$ per year. Finally, we recommend that evaluators both publicize and take into account the price of benchmarking as an essential part of measuring the real-world impact of AI.