Dyn-O: Building Structured World Models with Object-Centric Representations

📅 2025-07-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing object-centric world models are primarily validated in simple geometric environments and struggle to generalize to complex visual scenes characterized by rich textures and cluttered layouts. This paper proposes a structured world model that decouples dynamics-agnostic from dynamics-aware object representations. It integrates object-centric learning with an end-to-end pixel-input architecture, jointly optimizing contrastive learning and reconstruction losses to enable fine-grained representational manipulation and diverse imagined trajectory generation. Crucially, our model is the first to learn a decomposable, object-level world model from raw pixels on the Procgen benchmark. It significantly outperforms DreamerV3 in rollout prediction accuracy, demonstrating superior modeling fidelity and generalization capability in visually complex, procedurally generated environments.

Technology Category

Application Category

📝 Abstract
World models aim to capture the dynamics of the environment, enabling agents to predict and plan for future states. In most scenarios of interest, the dynamics are highly centered on interactions among objects within the environment. This motivates the development of world models that operate on object-centric rather than monolithic representations, with the goal of more effectively capturing environment dynamics and enhancing compositional generalization. However, the development of object-centric world models has largely been explored in environments with limited visual complexity (such as basic geometries). It remains underexplored whether such models can generalize to more complex settings with diverse textures and cluttered scenes. In this paper, we fill this gap by introducing Dyn-O, an enhanced structured world model built upon object-centric representations. Compared to prior work in object-centric representations, Dyn-O improves in both learning representations and modeling dynamics. On the challenging Procgen games, we find that our method can learn object-centric world models directly from pixel observations, outperforming DreamerV3 in rollout prediction accuracy. Furthermore, by decoupling object-centric features into dynamics-agnostic and dynamics-aware components, we enable finer-grained manipulation of these features and generate more diverse imagined trajectories.
Problem

Research questions and friction points this paper is trying to address.

Developing object-centric world models for complex environments
Enhancing representation learning and dynamics modeling in Dyn-O
Improving prediction accuracy and trajectory diversity in Procgen games
Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-centric world model for complex environments
Decoupling features into dynamics-agnostic and aware components
Improved representation learning and dynamics modeling
🔎 Similar Papers
No similar papers found.
Zizhao Wang
Zizhao Wang
UT Austin
reinforcement learning
K
Kaixin Wang
Microsoft Research Asia
L
Li Zhao
Microsoft Research Asia
P
Peter Stone
University of Texas at Austin, Sony AI
J
Jiang Bian
Microsoft Research Asia