Objects matter: object-centric world models improve reinforcement learning in visually complex environments

📅 2025-01-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

209K/year
🤖 AI Summary
Low sample efficiency of model-based reinforcement learning (MBRL) in visually complex environments arises because conventional pixel-level world models fail to capture small-scale, dynamic, decision-critical elements. To address this, we propose OC-STORM—an object-centric world model for MBRL. Its core innovation is the first deep integration of object-aware perception into MBRL: semantic segmentation localizes key objects; frozen vision foundation models (e.g., SAM or DINO) extract robust object features; and object–pixel dynamics are jointly modeled. Planning and policy optimization then leverage object-augmented imagined trajectories within a STORM-style rollout framework. Evaluated on high-complexity visual domains—including Atari and *Hollow Knight*—OC-STORM achieves substantial gains in both sample efficiency and policy performance. These results empirically validate that semantic-aware world modeling significantly enhances representation fidelity for decision-critical dynamics.

Technology Category

Application Category

📝 Abstract
Deep reinforcement learning has achieved remarkable success in learning control policies from pixels across a wide range of tasks, yet its application remains hindered by low sample efficiency, requiring significantly more environment interactions than humans to reach comparable performance. Model-based reinforcement learning (MBRL) offers a solution by leveraging learnt world models to generate simulated experience, thereby improving sample efficiency. However, in visually complex environments, small or dynamic elements can be critical for decision-making. Yet, traditional MBRL methods in pixel-based environments typically rely on auto-encoding with an $L_2$ loss, which is dominated by large areas and often fails to capture decision-relevant details. To address these limitations, we propose an object-centric MBRL pipeline, which integrates recent advances in computer vision to allow agents to focus on key decision-related elements. Our approach consists of four main steps: (1) annotating key objects related to rewards and goals with segmentation masks, (2) extracting object features using a pre-trained, frozen foundation vision model, (3) incorporating these object features with the raw observations to predict environmental dynamics, and (4) training the policy using imagined trajectories generated by this object-centric world model. Building on the efficient MBRL algorithm STORM, we call this pipeline OC-STORM. We demonstrate OC-STORM's practical value in overcoming the limitations of conventional MBRL approaches on both Atari games and the visually complex game Hollow Knight.
Problem

Research questions and friction points this paper is trying to address.

Complex Visual Information
Model-based Reinforcement Learning
Learning Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-Centric Model
Reinforcement Learning
Visual Attention
💼 Related Jobs
Vision Foundation Model Research Intern
Intrinsic
Salary Range$57.69—$57.69 USDAt Intrinsic, we are proud to be an equal opportunity workplace. Employment at Intrinsic is based solely on a person's merit and qualifications directly related to professional competence. Intrinsic does not discriminate against any employee or applicant because of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), or any other basis protected by law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. It is Intrinsic’s policy to comply with all applicable national, state and local laws pertaining to nondiscrimination and equal opportunity.
Mountain View, California / Mountain View (US-MTV), Mountain View, California, United States
W
Weipu Zhang
University of Edinburgh
Adam Jelley
Adam Jelley
University of Edinburgh
machine learningreinforcement learningrepresentation learning
T
Trevor A. McInroe
University of Edinburgh
A
A. Storkey
University of Edinburgh