LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the critical bottleneck of evaluating large language models’ (LLMs) capability in forecasting real-world future events. We propose the “LLM-as-a-Prophet” paradigm and introduce Prophet Arena—a dynamic, decomposable, and continuously updated benchmark encompassing authentic forecasting tasks in finance, economics, and related domains. Leveraging real-time data ingestion, pipeline-based task decomposition, calibration-aware evaluation, and multi-source information fusion analysis, we conduct large-scale controlled experiments. Results demonstrate that mainstream LLMs exhibit strong calibration, confidence consistency, and yield positive returns in simulated trading; however, they suffer from event memory bias and delayed responsiveness to new information. Prophet Arena constitutes the first systematic evaluation framework for predictive intelligence, advancing LLMs beyond comprehension-oriented capabilities toward genuine predictive competence.

Technology Category

Application Category

📝 Abstract
Forecasting is not only a fundamental intellectual pursuit but also is of significant importance to societal systems such as finance and economics. With the rapid advances of large language models (LLMs) trained on Internet-scale data, it raises the promise of employing LLMs to forecast real-world future events, an emerging paradigm we call "LLM-as-a-Prophet". This paper systematically investigates such predictive intelligence of LLMs. To this end, we build Prophet Arena, a general evaluation benchmark that continuously collects live forecasting tasks and decomposes each task into distinct pipeline stages, in order to support our controlled and large-scale experimentation. Our comprehensive evaluation reveals that many LLMs already exhibit impressive forecasting capabilities, reflected in, e.g., their small calibration errors, consistent prediction confidence and promising market returns. However, we also uncover key bottlenecks towards achieving superior predictive intelligence via LLM-as-a-Prophet, such as LLMs' inaccurate event recalls, misunderstanding of data sources and slower information aggregation compared to markets when resolution nears.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' predictive capabilities for real-world events
Identifying bottlenecks in LLM-based forecasting systems
Developing benchmark for controlled forecasting experiments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prophet Arena benchmark for controlled forecasting evaluation
LLMs demonstrate small calibration errors and consistent confidence
Identified bottlenecks in event recall and data understanding
🔎 Similar Papers
No similar papers found.
Q
Qingchuan Yang
University of Southern California
S
Simon Mahns
Meta
Sida Li
Sida Li
Undergraduate, Peking University
Multimodal LLMStable diffusion
A
Anri Gu
The University of Chicago
Jibang Wu
Jibang Wu
University of Chicago
Algorithmic Game TheoryMachine LearningRecommendation Systems
H
Haifeng Xu
The University of Chicago