🤖 AI Summary
Evaluating large language models (LLMs) incurs prohibitive experimental costs, especially when benchmarking across diverse tasks and configurations.
Method: We propose *pre-experimental performance prediction*—a novel paradigm that forecasts LLM benchmark scores solely from task descriptions and model configurations, without accessing ground-truth data instances. To support this, we construct PRECOG, the first data-leakage-free, multi-task, multi-domain corpus linking task descriptions to empirical performance. We design a text-driven prediction framework integrating systematic task encoding, retrieval-augmented evidence aggregation, and uncertainty calibration.
Results: Our method achieves a mean absolute error of 8.7 on high-confidence predictions. Notably, even under zero-leakage conditions—i.e., for entirely unseen, unpublished benchmarks—GPT-5 augmented with web search retains nontrivial predictive capability. This work enables proactive evaluation and experiment prioritization, establishing a new pathway toward efficient, scalable LLM assessment.
📝 Abstract
Progress in large language models is constrained by an evaluation bottleneck: build a benchmark, evaluate models and settings, then iterate. We therefore ask a simple question: can we forecast outcomes before running any experiments? We study text-only performance forecasting: estimating a model's score from a redacted task description and intended configuration, with no access to dataset instances. To support systematic study, we curate PRECOG, a corpus of redacted description-performance pairs spanning diverse tasks, domains, and metrics. Experiments show the task is challenging but feasible: models equipped with a retrieval module that excludes source papers achieve moderate prediction performance with well-calibrated uncertainty, reaching mean absolute error as low as 8.7 on the Accuracy subset at high-confidence thresholds. Our analysis indicates that stronger reasoning models engage in diverse, iterative querying, whereas current open-source models lag and often skip retrieval or gather evidence with limited diversity. We further test a zero-leakage setting, forecasting on newly released datasets or experiments before their papers are indexed, where GPT-5 with built-in web search still attains nontrivial prediction accuracy. Overall, our corpus and analyses offer an initial step toward open-ended anticipatory evaluation, supporting difficulty estimation and smarter experiment prioritization.