ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

📅 2024-09-30
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing forecasting evaluation frameworks lack dynamic, future-oriented benchmarks that isolate models from temporal leakage and adapt to evolving real-world events. Method: We introduce ForecastBench—the first fully time-isolated, dynamic benchmark for AI forecasting capability—automatically generating and continuously updating 1,000 future-event prediction questions with zero answer leakage risk. Rigorous evaluation employs timestamp validation, multi-source human forecasts, and strict statistical testing (p < 0.01) to assess experts, the general public, and large language models (LLMs). Contribution/Results: On a 200-question subset, domain experts significantly outperform state-of-the-art LLMs (p < 0.01), exposing fundamental limitations of current LLMs in genuine forward-looking prediction tasks. ForecastBench establishes a scalable, reproducible, and openly accessible evaluation infrastructure (forecastbench.org), providing the first standardized, real-time platform for advancing research on predictive intelligence.

Technology Category

Application Category

📝 Abstract
Forecasts of future events are essential inputs into informed decision-making. Machine learning (ML) systems have the potential to deliver forecasts at scale, but there is no framework for evaluating the accuracy of ML systems on a standardized set of forecasting questions. To address this gap, we introduce ForecastBench: a dynamic benchmark that evaluates the accuracy of ML systems on an automatically generated and regularly updated set of 1,000 forecasting questions. To avoid any possibility of data leakage, ForecastBench is comprised solely of questions about future events that have no known answer at the time of submission. We quantify the capabilities of current ML systems by collecting forecasts from expert (human) forecasters, the general public, and LLMs on a random subset of questions from the benchmark ($N=200$). While LLMs have achieved super-human performance on many benchmarks, they perform less well here: expert forecasters outperform the top-performing LLM (p-value $<0.01$). We display system and human scores in a public leaderboard at www.forecastbench.org.
Problem

Research questions and friction points this paper is trying to address.

Artificial Intelligence
Human Prediction
Benchmark Platform
Innovation

Methods, ideas, or system contributions that make the work stand out.

ForecastBench
AI Prediction Capabilities
Expert vs AI Forecasting
🔎 Similar Papers
No similar papers found.
Ezra Karger
Ezra Karger
Federal Reserve Bank of Chicago
labor economicspublic economics
H
Houtan Bastani
Forecasting Research Institute
C
Chen Yueh-Han
New York University
Z
Zachary Jacobs
Forecasting Research Institute
D
Danny Halawi
University of California, Berkeley
Fred Zhang
Fred Zhang
Google DeepMind
Machine Learning
P
P. Tetlock
Forecasting Research Institute, University of Pennsylvania