Look Before you Leap: Estimating LLM Benchmark Scores from Descriptions

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Evaluating large language models (LLMs) incurs prohibitive experimental costs, especially when benchmarking across diverse tasks and configurations. Method: We propose *pre-experimental performance prediction*—a novel paradigm that forecasts LLM benchmark scores solely from task descriptions and model configurations, without accessing ground-truth data instances. To support this, we construct PRECOG, the first data-leakage-free, multi-task, multi-domain corpus linking task descriptions to empirical performance. We design a text-driven prediction framework integrating systematic task encoding, retrieval-augmented evidence aggregation, and uncertainty calibration. Results: Our method achieves a mean absolute error of 8.7 on high-confidence predictions. Notably, even under zero-leakage conditions—i.e., for entirely unseen, unpublished benchmarks—GPT-5 augmented with web search retains nontrivial predictive capability. This work enables proactive evaluation and experiment prioritization, establishing a new pathway toward efficient, scalable LLM assessment.

Technology Category

Application Category

📝 Abstract

Progress in large language models is constrained by an evaluation bottleneck: build a benchmark, evaluate models and settings, then iterate. We therefore ask a simple question: can we forecast outcomes before running any experiments? We study text-only performance forecasting: estimating a model's score from a redacted task description and intended configuration, with no access to dataset instances. To support systematic study, we curate PRECOG, a corpus of redacted description-performance pairs spanning diverse tasks, domains, and metrics. Experiments show the task is challenging but feasible: models equipped with a retrieval module that excludes source papers achieve moderate prediction performance with well-calibrated uncertainty, reaching mean absolute error as low as 8.7 on the Accuracy subset at high-confidence thresholds. Our analysis indicates that stronger reasoning models engage in diverse, iterative querying, whereas current open-source models lag and often skip retrieval or gather evidence with limited diversity. We further test a zero-leakage setting, forecasting on newly released datasets or experiments before their papers are indexed, where GPT-5 with built-in web search still attains nontrivial prediction accuracy. Overall, our corpus and analyses offer an initial step toward open-ended anticipatory evaluation, supporting difficulty estimation and smarter experiment prioritization.

Problem

Research questions and friction points this paper is trying to address.

Forecasting LLM benchmark scores before running experiments

Estimating model performance from task descriptions without datasets

Addressing evaluation bottlenecks through predictive performance estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-only performance forecasting from redacted task descriptions

Retrieval module excluding source papers for benchmark prediction

Zero-leakage forecasting on newly released datasets before indexing

🔎 Similar Papers

No similar papers found.