When Object-Centric World Models Meet Policy Learning: From Pixels to Policies, and Where It Breaks

📅 2025-11-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether object-centric world models (OCWMs) can improve the generalization and sample efficiency of reinforcement learning (RL) policies under novel feature compositions via explicitly disentangled object representations. We find that while object-level perception enhances visual modeling robustness, multi-object interactions induce latent representation shifts that severely destabilize policy training. To address this, we propose DLPWM—a fully unsupervised disentangled object-centric world model—comprising a disentangled encoder, dynamic object modeling module, and predictive network that jointly learn stable object-level latent variables directly from pixels. Experiments demonstrate DLPWM’s strong robustness and generalization in reconstruction and prediction tasks; however, it underperforms DreamerV3 in downstream control. This constitutes the first systematic evidence that representation stability—not merely disentanglement or expressiveness—is the critical bottleneck limiting OCWM-driven policy learning.

Technology Category

Application Category

📝 Abstract
Object-centric world models (OCWM) aim to decompose visual scenes into object-level representations, providing structured abstractions that could improve compositional generalization and data efficiency in reinforcement learning. We hypothesize that explicitly disentangled object-level representations, by localizing task-relevant information, can enhance policy performance across novel feature combinations. To test this hypothesis, we introduce DLPWM, a fully unsupervised, disentangled object-centric world model that learns object-level latents directly from pixels. DLPWM achieves strong reconstruction and prediction performance, including robustness to several out-of-distribution (OOD) visual variations. However, when used for downstream model-based control, policies trained on DLPWM latents underperform compared to DreamerV3. Through latent-trajectory analyses, we identify representation shift during multi-object interactions as a key driver of unstable policy learning. Our results suggest that, although object-centric perception supports robust visual modeling, achieving stable control requires mitigating latent drift.
Problem

Research questions and friction points this paper is trying to address.

Object-centric world models decompose scenes into object representations for reinforcement learning
DLPWM model learns object latents from pixels but causes unstable policy learning
Representation shift during object interactions drives latent drift in policy control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised object-centric world model from pixels
Object-level latents for robust visual prediction
Mitigating latent drift for stable policy control
🔎 Similar Papers
No similar papers found.
Stefano Ferraro
Stefano Ferraro
IDLAB, Gent University, Gent, Belgium
A
Akihiro Nakano
Graduate School of Engineering, The University of Tokyo, Tokyo, Japan
Masahiro Suzuki
Masahiro Suzuki
The University of Tokyo
Artificial intelligenceDeep learning
Y
Yutaka Matsuo
Graduate School of Engineering, The University of Tokyo, Tokyo, Japan