🤖 AI Summary
This study addresses the critical bottleneck of evaluating large language models’ (LLMs) capability in forecasting real-world future events. We propose the “LLM-as-a-Prophet” paradigm and introduce Prophet Arena—a dynamic, decomposable, and continuously updated benchmark encompassing authentic forecasting tasks in finance, economics, and related domains. Leveraging real-time data ingestion, pipeline-based task decomposition, calibration-aware evaluation, and multi-source information fusion analysis, we conduct large-scale controlled experiments. Results demonstrate that mainstream LLMs exhibit strong calibration, confidence consistency, and yield positive returns in simulated trading; however, they suffer from event memory bias and delayed responsiveness to new information. Prophet Arena constitutes the first systematic evaluation framework for predictive intelligence, advancing LLMs beyond comprehension-oriented capabilities toward genuine predictive competence.
📝 Abstract
Forecasting is not only a fundamental intellectual pursuit but also is of significant importance to societal systems such as finance and economics. With the rapid advances of large language models (LLMs) trained on Internet-scale data, it raises the promise of employing LLMs to forecast real-world future events, an emerging paradigm we call "LLM-as-a-Prophet". This paper systematically investigates such predictive intelligence of LLMs. To this end, we build Prophet Arena, a general evaluation benchmark that continuously collects live forecasting tasks and decomposes each task into distinct pipeline stages, in order to support our controlled and large-scale experimentation. Our comprehensive evaluation reveals that many LLMs already exhibit impressive forecasting capabilities, reflected in, e.g., their small calibration errors, consistent prediction confidence and promising market returns. However, we also uncover key bottlenecks towards achieving superior predictive intelligence via LLM-as-a-Prophet, such as LLMs' inaccurate event recalls, misunderstanding of data sources and slower information aggregation compared to markets when resolution nears.