π€ AI Summary
Current world models lack a unified, controllable dynamical environment for systematic evaluation of their understanding of underlying physical rules. To address this, we propose SmallWorldβa novel benchmark specifically designed for controlled dynamical assessment of world models. It encompasses six distinct classes of dynamical systems and operates under fully observable conditions without requiring hand-crafted reward signals. The framework supports forward modeling and multi-step prediction analysis across mainstream architectures, including recurrent state-space models, Transformers, diffusion models, and neural ODEs. Our experiments provide the first quantitative characterization of significant differences across models in structural awareness and in the degradation patterns of long-horizon rollouts. These findings establish a reproducible empirical benchmark for representation learning and dynamical modeling, while also identifying concrete directions for architectural improvement.
π Abstract
Current world models lack a unified and controlled setting for systematic evaluation, making it difficult to assess whether they truly capture the underlying rules that govern environment dynamics. In this work, we address this open challenge by introducing the SmallWorld Benchmark, a testbed designed to assess world model capability under isolated and precisely controlled dynamics without relying on handcrafted reward signals. Using this benchmark, we conduct comprehensive experiments in the fully observable state space on representative architectures including Recurrent State Space Model, Transformer, Diffusion model, and Neural ODE, examining their behavior across six distinct domains. The experimental results reveal how effectively these models capture environment structure and how their predictions deteriorate over extended rollouts, highlighting both the strengths and limitations of current modeling paradigms and offering insights into future improvement directions in representation learning and dynamics modeling.