Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

๐Ÿ“… 2026-05-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

191K/year
๐Ÿค– AI Summary
This study addresses the paradoxical degradation of performance in more capable large language models (LLMs) when forecasting tail risks characterized by superlinear growth and regime shifts. By constructing an uncontaminated simulation benchmark, ForecastBench-Sim, alongside real-world datasets from pandemics and financial markets, the authors employ quantile decomposition, intra-model-family comparisons (e.g., Llama-3.1), and tail-sensitive evaluation metrics to uncover and validate a โ€œcapability-as-burdenโ€ phenomenon: stronger models systematically overestimate upper-tail risks, leading to worse distributional calibration. This effect proves robust across domains and persists despite the incorporation of domain-specific knowledge. The work advocates replacing conventional single-threshold metrics with continuous scoring mechanisms that explicitly account for tail behavior to better assess model reliability in high-stakes forecasting scenarios.
๐Ÿ“ Abstract
We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation. A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put. A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect. Domain knowledge does not reliably rescue calibration. This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks, reversing the sign of the capability--accuracy relationship on identical outputs. Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs. We recommend that LLM forecasting evaluations use continuous (and unbounded) measures of accuracy alongside bounded binary threshold metrics.
Problem

Research questions and friction points this paper is trying to address.

inverse scaling
tail risk
distributional forecasting
superlinear growth
regime change
Innovation

Methods, ideas, or system contributions that make the work stand out.

inverse scaling
tail risk
distributional forecasting
ForecastBench-Sim
LLM evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.