RoboReward: General-Purpose Vision-Language Reward Models for Robotics

📅 2026-01-02

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the scarcity of efficient, general, and automated reward signals for real-world robotic tasks, where existing approaches often rely on labor-intensive human annotations or brittle handcrafted designs. The authors introduce RoboReward—the first large-scale reward evaluation benchmark built from real robot data—and propose a negative-sample augmentation pipeline that combines counterfactual relabeling with temporal cropping to generate near-failure and negative examples. Using this data, they train specialized vision-language reward models with 4B and 8B parameters. These models significantly outperform much larger general-purpose vision-language models on short-horizon robotic tasks. Notably, the 8B variant surpasses Gemini Robotics-ER 1.5 by a large margin in real-world reinforcement learning and approaches the performance of human-designed rewards.

Technology Category

Application Category

📝 Abstract

A well-designed reward is critical for effective reinforcement learning-based policy improvement. In real-world robotics, obtaining such rewards typically requires either labor-intensive human labeling or brittle, handcrafted objectives. Vision-language models (VLMs) have shown promise as automatic reward models, yet their effectiveness on real robot tasks is poorly understood. In this work, we aim to close this gap by introducing (1) RoboReward, a robotics reward dataset and benchmark built on large-scale real-robot corpora from Open X-Embodiment (OXE) and RoboArena, and (2) vision-language reward models trained on this dataset (RoboReward 4B/8B). Because OXE is success-heavy and lacks failure examples, we propose a negative examples data augmentation pipeline that generates calibrated negative and near-misses via counterfactual relabeling of successful episodes and temporal clipping to create partial-progress outcomes from the same videos. Using this framework, we build a large training and evaluation dataset spanning diverse tasks and embodiments to test whether state-of-the-art VLMs can reliably provide rewards for robot learning. Our evaluation of open and proprietary VLMs finds that no model excels across tasks, highlighting substantial room for improvement. We then train general-purpose 4B- and 8B-parameter models that outperform much larger VLMs in assigning rewards for short-horizon robotic tasks. Finally, we deploy the 8B model in real-robot reinforcement learning and find that it improves policy learning over Gemini Robotics-ER 1.5 while narrowing the gap to RL training with human-provided rewards. We release the full dataset, trained reward models, and evaluation suite on our website to advance the development of general-purpose reward models in robotics: https://crfm.stanford.edu/helm/robo-reward-bench (project website).

Problem

Research questions and friction points this paper is trying to address.

robotics

reward modeling

vision-language models

reinforcement learning

real-world robot tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language reward models

negative example augmentation

counterfactual relabeling