Large Reward Models: Generalizable Online Robot Reward Generation with Vision-Language Models

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of designing generalizable reward functions in reinforcement learning for robotic manipulation. To overcome this limitation, the authors propose an online, multidimensional reward generation mechanism driven by a vision-language model (VLM), which eliminates the need for handcrafted rewards. The method dynamically provides real-time signals—capturing progress, task completion, and temporal consistency—directly from visual observations to iteratively refine an initial imitation learning policy in a closed-loop manner. Notably, this approach achieves zero-shot, online, and multidimensional VLM-based reward shaping without any task-specific reward engineering. Within only 30 reinforcement learning iterations, it substantially improves policy success rates, demonstrating exceptional sample efficiency and strong cross-task generalization capabilities.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) has shown great potential in refining robotic manipulation policies, yet its efficacy remains strongly bottlenecked by the difficulty of designing generalizable reward functions. In this paper, we propose a framework for online policy refinement by adapting foundation VLMs into online reward generators. We develop a robust, scalable reward model based on a state-of-the-art VLM, trained on a large-scale, multi-source dataset encompassing real-world robot trajectories, human-object interactions, and diverse simulated environments. Unlike prior approaches that evaluate entire trajectories post-hoc, our method leverages the VLM to formulate a multifaceted reward signal comprising process, completion, and temporal contrastive rewards based on current visual observations. Initializing with a base policy trained via Imitation Learning (IL), we employ these VLM rewards to guide the model to correct sub-optimal behaviors in a closed-loop manner. We evaluate our framework on challenging long-horizon manipulation benchmarks requiring sequential execution and precise control. Crucially, our reward model operates in a purely zero-shot manner within these test environments. Experimental results demonstrate that our method significantly improves the success rate of the initial IL policy within just 30 RL iterations, demonstrating remarkable sample efficiency. This empirical evidence highlights that VLM-generated signals can provide reliable feedback to resolve execution errors, effectively eliminating the need for manual reward engineering and facilitating efficient online refinement for robot learning.
Problem

Research questions and friction points this paper is trying to address.

reward function
generalizability
robotic manipulation
reinforcement learning
online reward generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models
Online Reward Generation
Zero-Shot Generalization
Reinforcement Learning
Robot Manipulation
🔎 Similar Papers
No similar papers found.
Y
Yanru Wu
USC Physical Superintelligence Lab
Weiduo Yuan
Weiduo Yuan
Master Student, USC
Robot LearningVLA
A
Ang Qi
USC Physical Superintelligence Lab
V
Vitor Guizilini
Toyota Research Institute
Jiageng Mao
Jiageng Mao
University of Southern California
RoboticsComputer Vision
Yue Wang
Yue Wang
USC
Computer VisionRobotics