QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

📅 2026-04-17

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Current evaluations of large language models (LLMs) predominantly focus on point estimation tasks, offering limited insight into their ability to express uncertainty in continuous numerical predictions. This work addresses this gap by introducing prediction intervals into LLM evaluation and proposes QuantSightBench, a novel benchmark that systematically assesses models’ scale awareness, confidence consistency, and calibration across diverse domains using two key metrics: coverage accuracy and interval sharpness. Experimental results reveal that none of the 11 state-of-the-art models achieve the target 90% coverage rate; the best-performing model, Gemini 3.1 Pro, attains only 79.1%. Moreover, calibration significantly degrades under extreme values, exposing a pervasive overconfidence issue among current LLMs.

Technology Category

Application Category

📝 Abstract

Forecasting has become a natural benchmark for reasoning under uncertainty. Yet existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions. In practice, however, forecasting spans a far broader scope. Across domains such as economics, public health, and social demographics, decisions hinge on numerical estimates over continuous quantities, a capability that current benchmarks do not capture. Evaluating such estimates requires a format that makes uncertainty explicit and testable. We propose prediction intervals as a natural and rigorous interface for this purpose. They demand scale awareness, internal consistency across confidence levels, and calibration over a continuum of outcomes, making them a more suitable evaluation format than point estimates for numerical forecasting. To assess this capability, we introduce a new benchmark QuantSightBench, and evaluate frontier models under multiple settings, assessing both empirical coverage and interval sharpness. Our results show that none of the 11 evaluated frontier and open-weight models achieves the 90\% coverage target, with the top performers Gemini 3.1 Pro (79.1\%), Grok 4 (76.4\%), and GPT-5.4 (75.3\%) all falling at least 10 percentage points short. Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.

Problem

Research questions and friction points this paper is trying to address.

forecasting

large language models

prediction intervals

quantitative estimation

uncertainty evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

prediction intervals

quantitative forecasting

large language models