RoboScape-R: Unified Reward-Observation World Models for Generalizable Robotics Training via RL

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Weak generalization of embodied policies across diverse scenarios remains a core challenge: imitation learning tends to overfit expert trajectories, while reinforcement learning lacks task-agnostic reward signals. This paper proposes a world-model-driven intrinsic reward mechanism that jointly models reward generation and state-transition dynamics—constituting the first unified observation-reward joint predictive world model. Training is fully end-to-end via self-supervised state prediction, eliminating the need for external reward annotations. Evaluated on out-of-distribution scenarios, our method achieves an average performance gain of 37.5%, significantly improving cross-task transferability. Moreover, it establishes a novel paradigm for online world model training, enabling adaptive policy learning without handcrafted rewards or expert demonstrations.

Technology Category

Application Category

📝 Abstract

Achieving generalizable embodied policies remains a key challenge. Traditional policy learning paradigms, including both Imitation Learning (IL) and Reinforcement Learning (RL), struggle to cultivate generalizability across diverse scenarios. While IL policies often overfit to specific expert trajectories, RL suffers from the inherent lack of a unified and general reward signal necessary for effective multi-scene generalization. We posit that the world model is uniquely capable of serving as a universal environment proxy to address this limitation. However, current world models primarily focus on their ability to predict observations and still rely on task-specific, handcrafted reward functions, thereby failing to provide a truly general training environment. Toward this problem, we propose RoboScape-R, a framework leveraging the world model to serve as a versatile, general-purpose proxy for the embodied environment within the RL paradigm. We introduce a novel world model-based general reward mechanism that generates''endogenous''rewards derived from the model's intrinsic understanding of real-world state transition dynamics. Extensive experiments demonstrate that RoboScape-R effectively addresses the limitations of traditional RL methods by providing an efficient and general training environment that substantially enhances the generalization capability of embodied policies. Our approach offers critical insights into utilizing the world model as an online training strategy and achieves an average 37.5% performance improvement over baselines under out-of-domain scenarios.

Problem

Research questions and friction points this paper is trying to address.

Develops a generalizable robotics training framework via RL

Introduces world model-based endogenous rewards for multi-scene generalization

Addresses overfitting in IL and reward scarcity in traditional RL

Innovation

Methods, ideas, or system contributions that make the work stand out.

World model as universal environment proxy for RL

Endogenous rewards from model's state transition dynamics

Enhancing generalization with efficient general training environment

🔎 Similar Papers

Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics