On-Robot Reinforcement Learning with Goal-Contrastive Rewards

📅 2024-10-25
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In real-world robotic reinforcement learning, sparse rewards hinder efficient exploration, while manually designing dense reward functions incurs high engineering costs. To address this, we propose Goal-Contrastive Reward (GCR), a reward-free, action-label-free dense reward learning framework. GCR leverages only unlabeled passive video data and introduces a novel joint optimization mechanism—comprising a goal-contrastive loss and an implicit value loss—to enable unsupervised video representation learning and policy co-training. Crucially, GCR supports cross-embodiment transfer (e.g., from human or heterogeneous robot videos to the agent’s own policy), substantially improving data scalability. Extensive evaluations on simulated environments and real-world Franka and Spot platforms demonstrate that GCR achieves twice the sample efficiency of baseline methods (e.g., SAC, PPO), successfully completes twice as many manipulation tasks, and—critically—achieves, for the first time, forward cross-embodiment transfer from third-person video demonstrations to embodied policy execution.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) has the potential to enable robots to learn from their own actions in the real world. Unfortunately, RL can be prohibitively expensive, in terms of on-robot runtime, due to inefficient exploration when learning from a sparse reward signal. Designing dense reward functions is labour-intensive and requires domain expertise. In our work, we propose GCR (Goal-Contrastive Rewards), a dense reward function learning method that can be trained on passive video demonstrations. By using videos without actions, our method is easier to scale, as we can use arbitrary videos. GCR combines two loss functions, an implicit value loss function that models how the reward increases when traversing a successful trajectory, and a goal-contrastive loss that discriminates between successful and failed trajectories. We perform experiments in simulated manipulation environments across RoboMimic and MimicGen tasks, as well as in the real world using a Franka arm and a Spot quadruped. We find that GCR leads to a more-sample efficient RL, enabling model-free RL to solve about twice as many tasks as our baseline reward learning methods. We also demonstrate positive cross-embodiment transfer from videos of people and of other robots performing a task. Website: https://gcr-robot.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Efficient on-robot reinforcement learning exploration
Dense reward function design without domain expertise
Cross-embodiment transfer from passive video demonstrations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Goal-Contrastive Rewards for dense learning
Combines implicit value and contrastive loss
Uses passive video demonstrations for scaling
🔎 Similar Papers
No similar papers found.