Assessing Large Language Models in Updating Their Forecasts with New Information

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

This study investigates large language models’ (LLMs) ability to dynamically update predictions and calibrate confidence upon receiving new information post-training, benchmarking against human forecasters. To this end, we introduce EVOLVECAST—the first systematic evaluation framework for belief evolution—based on a pre-post prediction comparison paradigm triggered by novel evidence. It quantifies consistency, magnitude, and confidence shifts in model beliefs, while comparing confidence estimates derived from text outputs versus logits. Experimental results reveal that although LLMs respond to new evidence, their belief updates are often inconsistent and markedly conservative; moreover, their confidence calibration is substantially inferior to humans’, exhibiting systematic underconfidence. This work provides the first empirical characterization of LLMs’ critical cognitive limitations in dynamic forecasting scenarios, establishing a novel benchmark and evidence-based foundation for designing and evaluating trustworthy predictive systems.

Technology Category

Application Category

📝 Abstract

Prior work has largely treated future event prediction as a static task, failing to consider how forecasts and the confidence in them should evolve as new evidence emerges. To address this gap, we introduce EVOLVECAST, a framework for evaluating whether large language models appropriately revise their predictions in response to new information. In particular, EVOLVECAST assesses whether LLMs adjust their forecasts when presented with information released after their training cutoff. We use human forecasters as a comparative reference to analyze prediction shifts and confidence calibration under updated contexts. While LLMs demonstrate some responsiveness to new information, their updates are often inconsistent or overly conservative. We further find that neither verbalized nor logits-based confidence estimates consistently outperform the other, and both remain far from the human reference standard. Across settings, models tend to express conservative bias, underscoring the need for more robust approaches to belief updating.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to update forecasts with new information

Assessing prediction shifts and confidence calibration against human standards

Addressing inconsistent and conservative belief updating in language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework evaluates LLM forecast updates with new information

Compares model prediction shifts against human forecaster benchmarks

Analyzes confidence calibration under evolving evidence contexts

🔎 Similar Papers

Is Your LLM Outdated? A Deep Look at Temporal Generalization