🤖 AI Summary
This work investigates whether object-centric world models (OCWMs) can improve the generalization and sample efficiency of reinforcement learning (RL) policies under novel feature compositions via explicitly disentangled object representations. We find that while object-level perception enhances visual modeling robustness, multi-object interactions induce latent representation shifts that severely destabilize policy training. To address this, we propose DLPWM—a fully unsupervised disentangled object-centric world model—comprising a disentangled encoder, dynamic object modeling module, and predictive network that jointly learn stable object-level latent variables directly from pixels. Experiments demonstrate DLPWM’s strong robustness and generalization in reconstruction and prediction tasks; however, it underperforms DreamerV3 in downstream control. This constitutes the first systematic evidence that representation stability—not merely disentanglement or expressiveness—is the critical bottleneck limiting OCWM-driven policy learning.
📝 Abstract
Object-centric world models (OCWM) aim to decompose visual scenes into object-level representations, providing structured abstractions that could improve compositional generalization and data efficiency in reinforcement learning. We hypothesize that explicitly disentangled object-level representations, by localizing task-relevant information, can enhance policy performance across novel feature combinations. To test this hypothesis, we introduce DLPWM, a fully unsupervised, disentangled object-centric world model that learns object-level latents directly from pixels. DLPWM achieves strong reconstruction and prediction performance, including robustness to several out-of-distribution (OOD) visual variations. However, when used for downstream model-based control, policies trained on DLPWM latents underperform compared to DreamerV3. Through latent-trajectory analyses, we identify representation shift during multi-object interactions as a key driver of unstable policy learning. Our results suggest that, although object-centric perception supports robust visual modeling, achieving stable control requires mitigating latent drift.