Benchmarking World-Model Learning

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current world model agents over-rely on next-frame prediction and reward maximization, diverging from their original objective of supporting multi-task reasoning. To address this, we propose WorldTest, a novel evaluation protocol introducing the “reward-free exploration—cross-environment testing” paradigm, which decouples dynamic learning from task execution and establishes a model-agnostic, open-ended benchmarking framework for world models. Its core components include a reward-free exploration strategy, generalized task transfer evaluation across environments, and a behavior consistency scoring mechanism. Evaluated on AutumnBench—a comprehensive suite comprising 43 grid-world environments and 129 tasks—WorldTest covers three fundamental task families: masked frame prediction, planning, and causal dynamic modeling. Experiments reveal that human performance substantially surpasses state-of-the-art models; moreover, computational scaling yields marginal gains in only a subset of environments, exposing a fundamental bottleneck in current world models’ capacity to represent environmental dynamics.

Technology Category

Application Category

📝 Abstract
Model-learning agents should gather information to learn world models that support many downstream tasks and inferences, such as predicting unobserved states, estimating near- and far-term consequences of actions, planning action sequences, and detecting changes in dynamics. Current methods for learning and evaluating world models diverge from this goal: training and evaluation are anchored to next-frame prediction, and success is scored by reward maximization in the same environment. We propose WorldTest, a protocol to evaluate model-learning agents that separates reward-free interaction from a scored test phase in a different but related environment. WorldTest is open-ended$unicode{x2014}$models should support many different tasks unknown ahead of time$unicode{x2014}$and agnostic to model representation, allowing comparison across approaches. We instantiated WorldTest with AutumnBench, a suite of 43 interactive grid-world environments and 129 tasks across three families: masked-frame prediction, planning, and predicting changes to the causal dynamics. We compared 517 human participants and three frontier models on AutumnBench. We found that humans outperform the models, and scaling compute improves performance only in some environments but not others. WorldTest provides a novel template$unicode{x2014}$reward-free exploration, derived tests, and behavior-based scoring$unicode{x2014}$to evaluate what agents learn about environment dynamics, and AutumnBench exposes significant headroom in world-model learning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating model-learning agents beyond next-frame prediction metrics
Testing world models in different environments with unknown tasks
Assessing agents' ability to learn general environment dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

WorldTest protocol separates reward-free interaction from testing
AutumnBench provides 43 grid-world environments with 129 tasks
Evaluation uses behavior-based scoring across multiple derived tests