Learning Interactive World Model for Object-Centric Reinforcement Learning

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing object-centric reinforcement learning approaches implicitly model object interactions, limiting policy robustness and transferability. This paper proposes the Factored Interactive Object-Centric World Model (FIOC-WM), a decoupled and modular framework that explicitly models inter-object interaction structure for the first time, decomposing tasks into composable interaction primitives to support hierarchical policy learning. FIOC-WM operates directly on pixel inputs, integrating a pre-trained vision encoder with a hierarchical RL architecture to jointly learn object-centric representations and interaction graph structure. Evaluated on simulated robotics and embodied AI benchmarks, FIOC-WM achieves significant improvements in sample efficiency (+37% on average) and cross-task generalization, demonstrating that explicit interaction modeling is critical for robust control.

Technology Category

Application Category

📝 Abstract
Agents that understand objects and their interactions can learn policies that are more robust and transferable. However, most object-centric RL methods factor state by individual objects while leaving interactions implicit. We introduce the Factored Interactive Object-Centric World Model (FIOC-WM), a unified framework that learns structured representations of both objects and their interactions within a world model. FIOC-WM captures environment dynamics with disentangled and modular representations of object interactions, improving sample efficiency and generalization for policy learning. Concretely, FIOC-WM first learns object-centric latents and an interaction structure directly from pixels, leveraging pre-trained vision encoders. The learned world model then decomposes tasks into composable interaction primitives, and a hierarchical policy is trained on top: a high level selects the type and order of interactions, while a low level executes them. On simulated robotic and embodied-AI benchmarks, FIOC-WM improves policy-learning sample efficiency and generalization over world-model baselines, indicating that explicit, modular interaction learning is crucial for robust control.
Problem

Research questions and friction points this paper is trying to address.

Learning explicit object interactions from visual inputs for reinforcement learning
Improving sample efficiency and generalization in object-centric world models
Developing modular interaction representations for robust policy learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns object-centric latents from pixels
Decomposes tasks into composable interaction primitives
Uses hierarchical policy for interaction selection and execution
🔎 Similar Papers
No similar papers found.