Forecasting Downstream Performance of LLMs With Proxy Metrics

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

173K/year
🤖 AI Summary
This work addresses the lack of efficient and reliable methods for predicting downstream performance of language models, as cross-entropy loss exhibits poor alignment with actual capabilities, while direct evaluation is computationally expensive and sparse. The study proposes the first systematic approach leveraging token-level signals from expert-written text—such as entropy, top-k accuracy, and expert token rankings—to construct lightweight, general-purpose proxy metrics by aggregating statistics from next-token prediction distributions. The resulting method demonstrates strong generalization across diverse model architectures, datasets, and training stages: it achieves a Spearman correlation of 0.81 in model selection (substantially outperforming the 0.36 obtained with loss), reduces computational costs in pretraining data selection by approximately 10⁴-fold, and cuts in-training performance prediction error by nearly 50%.
📝 Abstract
Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals are fundamentally limited. Cross-entropy loss is poorly aligned with downstream capabilities, and direct downstream evaluation is expensive, sparse, and often uninformative at early training stages. Instead, we propose to construct proxy metrics by aggregating token-level statistics, such as entropy, top-k accuracy, and expert token rank, from a candidate model's next token distribution over expert-written solutions. Across three settings, our proxies consistently outperform loss- and compute-based baselines: 1) For cross-family model selection, they rank a heterogeneous population of reasoning models with mean Spearman Rho = 0.81 (vs. Rho = 0.36 for cross-entropy loss); 2) For pretraining data selection, they reliably rank 25 candidate corpora for a target model at roughly $10{,}000\times$ less compute than direct evaluation, pushing the Pareto frontier beyond existing methods; and 3) for training-time forecasting, they extrapolate downstream accuracy across an $18\times$ compute horizon with roughly half the error of existing alternatives. Together, these results suggest that expert trajectories are a broadly useful source of signal for assessing model capabilities, enabling reliable performance forecasting throughout the model development life cycle.
Problem

Research questions and friction points this paper is trying to address.

downstream performance forecasting
large language models
proxy metrics
training efficiency
model evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

proxy metrics
token-level statistics
downstream performance forecasting
expert trajectories
large language models