LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This study addresses the critical bottleneck of evaluating large language models’ (LLMs) capability in forecasting real-world future events. We propose the “LLM-as-a-Prophet” paradigm and introduce Prophet Arena—a dynamic, decomposable, and continuously updated benchmark encompassing authentic forecasting tasks in finance, economics, and related domains. Leveraging real-time data ingestion, pipeline-based task decomposition, calibration-aware evaluation, and multi-source information fusion analysis, we conduct large-scale controlled experiments. Results demonstrate that mainstream LLMs exhibit strong calibration, confidence consistency, and yield positive returns in simulated trading; however, they suffer from event memory bias and delayed responsiveness to new information. Prophet Arena constitutes the first systematic evaluation framework for predictive intelligence, advancing LLMs beyond comprehension-oriented capabilities toward genuine predictive competence.

Technology Category

Application Category

📝 Abstract

Forecasting is not only a fundamental intellectual pursuit but also is of significant importance to societal systems such as finance and economics. With the rapid advances of large language models (LLMs) trained on Internet-scale data, it raises the promise of employing LLMs to forecast real-world future events, an emerging paradigm we call "LLM-as-a-Prophet". This paper systematically investigates such predictive intelligence of LLMs. To this end, we build Prophet Arena, a general evaluation benchmark that continuously collects live forecasting tasks and decomposes each task into distinct pipeline stages, in order to support our controlled and large-scale experimentation. Our comprehensive evaluation reveals that many LLMs already exhibit impressive forecasting capabilities, reflected in, e.g., their small calibration errors, consistent prediction confidence and promising market returns. However, we also uncover key bottlenecks towards achieving superior predictive intelligence via LLM-as-a-Prophet, such as LLMs' inaccurate event recalls, misunderstanding of data sources and slower information aggregation compared to markets when resolution nears.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' predictive capabilities for real-world events

Identifying bottlenecks in LLM-based forecasting systems

Developing benchmark for controlled forecasting experiments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prophet Arena benchmark for controlled forecasting evaluation

LLMs demonstrate small calibration errors and consistent confidence

Identified bottlenecks in event recall and data understanding

🔎 Similar Papers

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities