🤖 AI Summary
Existing world models struggle to model high-dimensional, non-stationary environments with strong multi-object interactions, as they rely on holistic scene representations rather than object-level decomposition. To address this, we propose STICA: a Slot Transformer–based framework that decomposes visual observations into object-centric token sequences and fuses action and reward tokens into the sequence. Crucially, STICA introduces causal attention—its first application in world modeling—enabling token-level causal reasoning and thereby enhancing interpretability and decision efficiency of downstream policy and value networks. Experiments on complex multi-object interaction tasks demonstrate that STICA significantly outperforms state-of-the-art methods, achieving substantial improvements in both sample efficiency and asymptotic performance.
📝 Abstract
World models have been developed to support sample-efficient deep reinforcement learning agents. However, it remains challenging for world models to accurately replicate environments that are high-dimensional, non-stationary, and composed of multiple objects with rich interactions since most world models learn holistic representations of all environmental components. By contrast, humans perceive the environment by decomposing it into discrete objects, facilitating efficient decision-making. Motivated by this insight, we propose emph{Slot Transformer Imagination with CAusality-aware reinforcement learning} (STICA), a unified framework in which object-centric Transformers serve as the world model and causality-aware policy and value networks. STICA represents each observation as a set of object-centric tokens, together with tokens for the agent action and the resulting reward, enabling the world model to predict token-level dynamics and interactions. The policy and value networks then estimate token-level cause--effect relations and use them in the attention layers, yielding causality-guided decision-making. Experiments on object-rich benchmarks demonstrate that STICA consistently outperforms state-of-the-art agents in both sample efficiency and final performance.