🤖 AI Summary
Existing object-centric reinforcement learning approaches implicitly model object interactions, limiting policy robustness and transferability. This paper proposes the Factored Interactive Object-Centric World Model (FIOC-WM), a decoupled and modular framework that explicitly models inter-object interaction structure for the first time, decomposing tasks into composable interaction primitives to support hierarchical policy learning. FIOC-WM operates directly on pixel inputs, integrating a pre-trained vision encoder with a hierarchical RL architecture to jointly learn object-centric representations and interaction graph structure. Evaluated on simulated robotics and embodied AI benchmarks, FIOC-WM achieves significant improvements in sample efficiency (+37% on average) and cross-task generalization, demonstrating that explicit interaction modeling is critical for robust control.
📝 Abstract
Agents that understand objects and their interactions can learn policies that are more robust and transferable. However, most object-centric RL methods factor state by individual objects while leaving interactions implicit. We introduce the Factored Interactive Object-Centric World Model (FIOC-WM), a unified framework that learns structured representations of both objects and their interactions within a world model. FIOC-WM captures environment dynamics with disentangled and modular representations of object interactions, improving sample efficiency and generalization for policy learning. Concretely, FIOC-WM first learns object-centric latents and an interaction structure directly from pixels, leveraging pre-trained vision encoders. The learned world model then decomposes tasks into composable interaction primitives, and a hierarchical policy is trained on top: a high level selects the type and order of interactions, while a low level executes them. On simulated robotic and embodied-AI benchmarks, FIOC-WM improves policy-learning sample efficiency and generalization over world-model baselines, indicating that explicit, modular interaction learning is crucial for robust control.