🤖 AI Summary
This work proposes a simulator-free reinforcement learning framework that trains policies directly within a diffusion-based world model constructed from real robot interaction data. Traditional physics simulators struggle to accurately model contact dynamics, non-rigid body interactions, and visual perception, while existing world model approaches often incur prohibitive computational costs for policy gradient optimization. To address these challenges, the method integrates a global high-fidelity diffusion model with a local lightweight latent dynamics proxy and introduces a decoupled first-order gradients (FoG) technique to enable efficient gradient propagation and high-fidelity trajectory rollouts. Evaluated on the Push-T task, the approach demonstrates significantly improved sample efficiency over PPO and further validates its effectiveness on a quadruped robot performing first-person object manipulation, highlighting the potential of data-driven world models in complex reinforcement learning tasks.
📝 Abstract
World models offer a promising avenue for more faithfully capturing complex dynamics, including contacts and non-rigidity, as well as complex sensory information, such as visual perception, in situations where standard simulators struggle. However, these models are computationally complex to evaluate, posing a challenge for popular RL approaches that have been successfully used with simulators to solve complex locomotion tasks but yet struggle with manipulation. This paper introduces a method that bypasses simulators entirely, training RL policies inside world models learned from robots'interactions with real environments. At its core, our approach enables policy training with large-scale diffusion models via a novel decoupled first-order gradient (FoG) method: a full-scale world model generates accurate forward trajectories, while a lightweight latent-space surrogate approximates its local dynamics for efficient gradient computation. This coupling of a local and global world model ensures high-fidelity unrolling alongside computationally tractable differentiation. We demonstrate the efficacy of our method on the Push-T manipulation task, where it significantly outperforms PPO in sample efficiency. We further evaluate our approach through an ego-centric object manipulation task with a quadruped. Together, these results demonstrate that learning inside data-driven world models is a promising pathway for solving hard-to-model RL tasks in image space without reliance on hand-crafted physics simulators.