๐ค AI Summary
This study addresses the paradoxical degradation of performance in more capable large language models (LLMs) when forecasting tail risks characterized by superlinear growth and regime shifts. By constructing an uncontaminated simulation benchmark, ForecastBench-Sim, alongside real-world datasets from pandemics and financial markets, the authors employ quantile decomposition, intra-model-family comparisons (e.g., Llama-3.1), and tail-sensitive evaluation metrics to uncover and validate a โcapability-as-burdenโ phenomenon: stronger models systematically overestimate upper-tail risks, leading to worse distributional calibration. This effect proves robust across domains and persists despite the incorporation of domain-specific knowledge. The work advocates replacing conventional single-threshold metrics with continuous scoring mechanisms that explicitly account for tail behavior to better assess model reliability in high-stakes forecasting scenarios.
๐ Abstract
We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation. A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put. A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect. Domain knowledge does not reliably rescue calibration. This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks, reversing the sign of the capability--accuracy relationship on identical outputs. Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs. We recommend that LLM forecasting evaluations use continuous (and unbounded) measures of accuracy alongside bounded binary threshold metrics.