π€ AI Summary
Current large language models struggle to effectively leverage consensus and disagreement in human crowd predictions for probabilistic forecasting, often resulting in poorly calibrated outputs. This work proposes the Beta-Bernoulli Calibrator (BBC), which, for the first time, jointly models the mean and uncertainty of human predictions as parameters of a Beta distribution. BBC employs a lightweight, post-hoc calibration framework that requires no model fine-tuning and transforms point predictions from any base model into well-calibrated probability estimates accompanied by epistemic uncertainty quantification. Built upon a hierarchical Bayesian model, BBC learns from binary outcomes and crowd predictions through supervised training. Empirical evaluations across multiple benchmarks demonstrate that BBC significantly outperforms conventional calibration methods and specialized fine-tuned models, with its epistemic uncertainty providing a more reliable predictor of actual prediction errors than the modelβs self-reported confidence.
π Abstract
Probabilistic forecasting estimates the likelihood of uncertain future events. To improve LLM forecasting, existing methods typically learn from binary outcomes to output verbalized forecasts. However, while aggregated human forecasts contain rich information in both the crowd probability estimate and the degree of agreement among forecasters, how to utilize these signals remains underexplored. To address this, we propose the Beta-Bernoulli Calibrator (BBC), which converts an initial point estimate forecast from any model into a distribution over event likelihood, using supervision from both binary outcomes and human forecasts. BBC models event likelihood $p \sim \text{Beta}(Ξ±, Ξ²)$ and outcome $y \sim \text{Bernoulli}(p)$, with the mean as the calibrated point forecast and the variance as the epistemic uncertainty. Our results show that BBC generally provides better calibrated and more accurate forecasts than both traditional post-hoc calibration methods and models fine-tuned specifically for forecasting, while remaining lightweight and having good generalization. We also show that the epistemic uncertainty captured by BBC is a more reliable predictor of forecasting error than verbalized confidence.