When Object-Centric World Models Meet Policy Learning: From Pixels to Policies, and Where It Breaks

📅 2025-11-08

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work investigates whether object-centric world models (OCWMs) can improve the generalization and sample efficiency of reinforcement learning (RL) policies under novel feature compositions via explicitly disentangled object representations. We find that while object-level perception enhances visual modeling robustness, multi-object interactions induce latent representation shifts that severely destabilize policy training. To address this, we propose DLPWM—a fully unsupervised disentangled object-centric world model—comprising a disentangled encoder, dynamic object modeling module, and predictive network that jointly learn stable object-level latent variables directly from pixels. Experiments demonstrate DLPWM’s strong robustness and generalization in reconstruction and prediction tasks; however, it underperforms DreamerV3 in downstream control. This constitutes the first systematic evidence that representation stability—not merely disentanglement or expressiveness—is the critical bottleneck limiting OCWM-driven policy learning.

Technology Category

Application Category

📝 Abstract

Object-centric world models (OCWM) aim to decompose visual scenes into object-level representations, providing structured abstractions that could improve compositional generalization and data efficiency in reinforcement learning. We hypothesize that explicitly disentangled object-level representations, by localizing task-relevant information, can enhance policy performance across novel feature combinations. To test this hypothesis, we introduce DLPWM, a fully unsupervised, disentangled object-centric world model that learns object-level latents directly from pixels. DLPWM achieves strong reconstruction and prediction performance, including robustness to several out-of-distribution (OOD) visual variations. However, when used for downstream model-based control, policies trained on DLPWM latents underperform compared to DreamerV3. Through latent-trajectory analyses, we identify representation shift during multi-object interactions as a key driver of unstable policy learning. Our results suggest that, although object-centric perception supports robust visual modeling, achieving stable control requires mitigating latent drift.

Problem

Research questions and friction points this paper is trying to address.

Object-centric world models decompose scenes into object representations for reinforcement learning

DLPWM model learns object latents from pixels but causes unstable policy learning

Representation shift during object interactions drives latent drift in policy control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised object-centric world model from pixels

Object-level latents for robust visual prediction

Mitigating latent drift for stable policy control

🔎 Similar Papers

PWM: Policy Learning with Multi-Task World Models