Unlocking the Forecasting Economy: A Suite of Datasets for the Full Lifecycle of Prediction Market: [Experiments \& Analysis]

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the absence of a unified, continuously maintained dataset for decentralized prediction markets, whose lifecycle data is fragmented across heterogeneous on-chain and off-chain sources. We present the first relational data system encompassing the complete market lifecycle—from creation and trading to settlement—by integrating three layers of information: market metadata, individual trades, and oracle events. Through identifier resolution, on-chain data reconstruction, and incremental synchronization mechanisms, our system harmonizes disparate data streams. The released dataset spans from October 2020 to March 2026, comprising over 770,000 markets, 943 million trades, and nearly 2 million oracle events. It enables applications such as NBA outcome calibration and CPI expectation reconstruction, establishing the first reproducible and scalable infrastructure for full-cycle prediction market research.

Technology Category

Application Category

📝 Abstract

Prediction markets are markets for trading claims on future events, such as presidential elections, and their prices provide continuously updated signals of collective beliefs. In decentralized platforms such as Polymarket, the market lifecycle spans market creation, token registration, trading, oracle interaction, dispute, and final settlement, yet the corresponding data are fragmented across heterogeneous off-chain and on-chain sources. We present the first continuously maintained dataset suite for the full lifecycle of decentralized prediction markets, built on Polymarket. To address the challenges of large-scale cross-source integration, incomplete linkage, and continuous synchronization, we build a unified relational data system that integrates three canonical layers: market metadata, fill-level trading records, and oracle-resolution events, through identifier resolution, on-chain recovery, and incremental updates. The resulting dataset spans October 2020 to March 2026 and comprises more than 770 thousand market records, over 943 million fill records, and nearly 2 million oracle events. We describe the data model, collection pipeline, and consistency mechanisms that make the dataset reproducible and extensible, and we demonstrate its utility through descriptive analyses of market activity and two downstream case studies: NBA outcome calibration and CPI expectation reconstruction.

Problem

Research questions and friction points this paper is trying to address.

prediction markets

decentralized platforms

data integration

market lifecycle

oracle events

Innovation

Methods, ideas, or system contributions that make the work stand out.

prediction markets

decentralized data integration

oracle resolution