SmallWorlds: Assessing Dynamics Understanding of World Models in Isolated Environments

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Current world models lack a unified, controllable dynamical environment for systematic evaluation of their understanding of underlying physical rules. To address this, we propose SmallWorld—a novel benchmark specifically designed for controlled dynamical assessment of world models. It encompasses six distinct classes of dynamical systems and operates under fully observable conditions without requiring hand-crafted reward signals. The framework supports forward modeling and multi-step prediction analysis across mainstream architectures, including recurrent state-space models, Transformers, diffusion models, and neural ODEs. Our experiments provide the first quantitative characterization of significant differences across models in structural awareness and in the degradation patterns of long-horizon rollouts. These findings establish a reproducible empirical benchmark for representation learning and dynamical modeling, while also identifying concrete directions for architectural improvement.

Technology Category

Application Category

📝 Abstract

Current world models lack a unified and controlled setting for systematic evaluation, making it difficult to assess whether they truly capture the underlying rules that govern environment dynamics. In this work, we address this open challenge by introducing the SmallWorld Benchmark, a testbed designed to assess world model capability under isolated and precisely controlled dynamics without relying on handcrafted reward signals. Using this benchmark, we conduct comprehensive experiments in the fully observable state space on representative architectures including Recurrent State Space Model, Transformer, Diffusion model, and Neural ODE, examining their behavior across six distinct domains. The experimental results reveal how effectively these models capture environment structure and how their predictions deteriorate over extended rollouts, highlighting both the strengths and limitations of current modeling paradigms and offering insights into future improvement directions in representation learning and dynamics modeling.

Problem

Research questions and friction points this paper is trying to address.

Evaluating world models' dynamics understanding lacks unified controlled settings

Assessing model capability through isolated environments without reward signals

Testing how models capture environment structure and prediction deterioration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SmallWorld Benchmark for controlled evaluation

Tests diverse architectures in isolated dynamics environments

Analyzes prediction degradation over extended model rollouts

🔎 Similar Papers

CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks