🤖 AI Summary
Low sample efficiency of model-based reinforcement learning (MBRL) in visually complex environments arises because conventional pixel-level world models fail to capture small-scale, dynamic, decision-critical elements. To address this, we propose OC-STORM—an object-centric world model for MBRL. Its core innovation is the first deep integration of object-aware perception into MBRL: semantic segmentation localizes key objects; frozen vision foundation models (e.g., SAM or DINO) extract robust object features; and object–pixel dynamics are jointly modeled. Planning and policy optimization then leverage object-augmented imagined trajectories within a STORM-style rollout framework. Evaluated on high-complexity visual domains—including Atari and *Hollow Knight*—OC-STORM achieves substantial gains in both sample efficiency and policy performance. These results empirically validate that semantic-aware world modeling significantly enhances representation fidelity for decision-critical dynamics.
📝 Abstract
Deep reinforcement learning has achieved remarkable success in learning control policies from pixels across a wide range of tasks, yet its application remains hindered by low sample efficiency, requiring significantly more environment interactions than humans to reach comparable performance. Model-based reinforcement learning (MBRL) offers a solution by leveraging learnt world models to generate simulated experience, thereby improving sample efficiency. However, in visually complex environments, small or dynamic elements can be critical for decision-making. Yet, traditional MBRL methods in pixel-based environments typically rely on auto-encoding with an $L_2$ loss, which is dominated by large areas and often fails to capture decision-relevant details. To address these limitations, we propose an object-centric MBRL pipeline, which integrates recent advances in computer vision to allow agents to focus on key decision-related elements. Our approach consists of four main steps: (1) annotating key objects related to rewards and goals with segmentation masks, (2) extracting object features using a pre-trained, frozen foundation vision model, (3) incorporating these object features with the raw observations to predict environmental dynamics, and (4) training the policy using imagined trajectories generated by this object-centric world model. Building on the efficient MBRL algorithm STORM, we call this pipeline OC-STORM. We demonstrate OC-STORM's practical value in overcoming the limitations of conventional MBRL approaches on both Atari games and the visually complex game Hollow Knight.