🤖 AI Summary
This work addresses the absence of a unified closed-loop learning environment that enables agents to continuously learn from real-world events and forecast future outcomes. To bridge this gap, we propose FutureWorld—the first framework that formulates real-time future prediction as a reinforcement learning environment. By integrating a closed-loop mechanism of prediction, outcome realization, and parameter update, FutureWorld effectively prevents answer leakage and supports continual learning. Built upon open-source large language models and grounded in real-world event feedback, the framework establishes a daily-updated benchmark for training and evaluation. Experimental results over consecutive days demonstrate the efficacy of our approach, setting a new state-of-the-art baseline and significantly advancing agents’ predictive capabilities.
📝 Abstract
Live future prediction refers to the task of making predictions about real-world events before they unfold. This task is increasingly studied using large language model-based agent systems, and it is important for building agents that can continually learn from real-world. Just as interactive environments have often driven progress in agents, advancing live future prediction naturally motivates viewing it as a learning environment. Prior works have explored future prediction from several different parts, but have generally not framed it as a unified learning environment. This task is appealing for learning because it can provide a large number of prediction questions grounded in diverse real-world events, while preventing answer leakage. To leverage the advantages of live future prediction, we present FutureWorld, a live agentic reinforcement learning environment that closes the training loop between prediction, outcome realization, and parameters update. In our environment, we take three open-source base models and train them for consecutive days. The results show that training is effective. Furthermore, we build a daily benchmark based on the environment and evaluate several frontier agents on it to establish performance baselines for current agent systems.