🤖 AI Summary
Current AI agents lack effective evaluation of their ability to continuously adapt to new information in dynamic, open-world environments. This work proposes the first benchmark based on chronological replay of real-world events, systematically assessing agents’ test-time adaptation, memory integration, and uncertainty reasoning by injecting news streams in temporal order and requiring predictions about events occurring after their knowledge cutoff date. The evaluation employs an event prediction task, Brier skill score, and ablation studies, implemented seamlessly through native agent interfaces. Empirical results from January to March 2026 reveal that even the best-performing agent achieves only 25% accuracy, with most performing worse than random guessing, highlighting a pronounced limitation in current models’ long-term adaptability.
📝 Abstract
AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.