RoboScape-R: Unified Reward-Observation World Models for Generalizable Robotics Training via RL

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Weak generalization of embodied policies across diverse scenarios remains a core challenge: imitation learning tends to overfit expert trajectories, while reinforcement learning lacks task-agnostic reward signals. This paper proposes a world-model-driven intrinsic reward mechanism that jointly models reward generation and state-transition dynamics—constituting the first unified observation-reward joint predictive world model. Training is fully end-to-end via self-supervised state prediction, eliminating the need for external reward annotations. Evaluated on out-of-distribution scenarios, our method achieves an average performance gain of 37.5%, significantly improving cross-task transferability. Moreover, it establishes a novel paradigm for online world model training, enabling adaptive policy learning without handcrafted rewards or expert demonstrations.

Technology Category

Application Category

📝 Abstract
Achieving generalizable embodied policies remains a key challenge. Traditional policy learning paradigms, including both Imitation Learning (IL) and Reinforcement Learning (RL), struggle to cultivate generalizability across diverse scenarios. While IL policies often overfit to specific expert trajectories, RL suffers from the inherent lack of a unified and general reward signal necessary for effective multi-scene generalization. We posit that the world model is uniquely capable of serving as a universal environment proxy to address this limitation. However, current world models primarily focus on their ability to predict observations and still rely on task-specific, handcrafted reward functions, thereby failing to provide a truly general training environment. Toward this problem, we propose RoboScape-R, a framework leveraging the world model to serve as a versatile, general-purpose proxy for the embodied environment within the RL paradigm. We introduce a novel world model-based general reward mechanism that generates''endogenous''rewards derived from the model's intrinsic understanding of real-world state transition dynamics. Extensive experiments demonstrate that RoboScape-R effectively addresses the limitations of traditional RL methods by providing an efficient and general training environment that substantially enhances the generalization capability of embodied policies. Our approach offers critical insights into utilizing the world model as an online training strategy and achieves an average 37.5% performance improvement over baselines under out-of-domain scenarios.
Problem

Research questions and friction points this paper is trying to address.

Develops a generalizable robotics training framework via RL
Introduces world model-based endogenous rewards for multi-scene generalization
Addresses overfitting in IL and reward scarcity in traditional RL
Innovation

Methods, ideas, or system contributions that make the work stand out.

World model as universal environment proxy for RL
Endogenous rewards from model's state transition dynamics
Enhancing generalization with efficient general training environment
🔎 Similar Papers
No similar papers found.
Yinzhou Tang
Yinzhou Tang
Tsinghua University
Yu Shang
Yu Shang
Department of Electronic Engineering, Tsinghua University
Multimodal LearningLLM AgentRecommender System
Y
Yinuo Chen
Tsinghua University
B
Bingwen Wei
Tsinghua University
X
Xin Zhang
Manifold AI
S
Shu'ang Yu
Tsinghua University
L
Liangzhi Shi
Tsinghua University
C
Chao Yu
Tsinghua University
C
Chen Gao
Tsinghua University
W
Wei Wu
Manifold AI
Y
Yong Li
Tsinghua University