Real-World Offline Reinforcement Learning from Vision Language Model Feedback

📅 2024-11-08
🏛️ arXiv.org
📈 Citations: 6
Influential: 1
📄 PDF
🤖 AI Summary
To address the challenges of offline reinforcement learning (RL) in real-world robotic settings—specifically, the absence of human-provided reward annotations and difficulty in obtaining precise state labels—this paper proposes an end-to-end framework that requires neither expert demonstrations nor manual labeling. Our method leverages vision-language models (VLMs) to generate preference-based reward labels directly from task text descriptions and suboptimal visual trajectories, then integrates these labels with implicit Q-learning (IQL) for unsupervised reward modeling and policy optimization. To our knowledge, this is the first work to systematically incorporate VLMs into offline RL for automated reward annotation. We evaluate the approach on real-robot assisted dressing tasks and simulated rigid/soft-body manipulation benchmarks. Results demonstrate substantial improvements over behavioral cloning and inverse RL baselines, validating the effectiveness and practicality of our framework for high-quality policy learning in complex embodied tasks.

Technology Category

Application Category

📝 Abstract
Offline reinforcement learning can enable policy learning from pre-collected, sub-optimal datasets without online interactions. This makes it ideal for real-world robots and safety-critical scenarios, where collecting online data or expert demonstrations is slow, costly, and risky. However, most existing offline RL works assume the dataset is already labeled with the task rewards, a process that often requires significant human effort, especially when ground-truth states are hard to ascertain (e.g., in the real-world). In this paper, we build on prior work, specifically RL-VLM-F, and propose a novel system that automatically generates reward labels for offline datasets using preference feedback from a vision-language model and a text description of the task. Our method then learns a policy using offline RL with the reward-labeled dataset. We demonstrate the system's applicability to a complex real-world robot-assisted dressing task, where we first learn a reward function using a vision-language model on a sub-optimal offline dataset, and then we use the learned reward to employ Implicit Q learning to develop an effective dressing policy. Our method also performs well in simulation tasks involving the manipulation of rigid and deformable objects, and significantly outperform baselines such as behavior cloning and inverse RL. In summary, we propose a new system that enables automatic reward labeling and policy learning from unlabeled, sub-optimal offline datasets.
Problem

Research questions and friction points this paper is trying to address.

Automates reward labeling for offline RL datasets
Learns policies from sub-optimal real-world robot data
Uses vision-language models to replace human feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates reward labeling using vision-language model
Learns policies from unlabeled offline datasets
Applies Implicit Q learning for policy development
🔎 Similar Papers
No similar papers found.