Policy Learning from Large Vision-Language Model Feedback without Reward Modeling

📅 2025-07-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Offline reinforcement learning typically relies on hand-crafted reward functions, which are costly to design and require domain expertise. To address this, we propose a vision-language model (VLM)-guided policy learning framework that eliminates explicit reward modeling: leveraging large-scale VLMs to generate natural language descriptions of visual trajectory segments and directly infer human-aligned preference labels from these descriptions; the policy network is then optimized end-to-end via a supervised contrastive learning objective. This work presents the first offline RL approach that trains policies solely from VLM-derived preference feedback, completely bypassing reward function engineering. On the MetaWorld benchmark, our method achieves performance comparable to or exceeding state-of-the-art VLM-based reward modeling methods. Furthermore, we validate its generalizability and practical utility on real-world robotic manipulation tasks.

Technology Category

Application Category

📝 Abstract
Offline reinforcement learning (RL) provides a powerful framework for training robotic agents using pre-collected, suboptimal datasets, eliminating the need for costly, time-consuming, and potentially hazardous online interactions. This is particularly useful in safety-critical real-world applications, where online data collection is expensive and impractical. However, existing offline RL algorithms typically require reward labeled data, which introduces an additional bottleneck: reward function design is itself costly, labor-intensive, and requires significant domain expertise. In this paper, we introduce PLARE, a novel approach that leverages large vision-language models (VLMs) to provide guidance signals for agent training. Instead of relying on manually designed reward functions, PLARE queries a VLM for preference labels on pairs of visual trajectory segments based on a language task description. The policy is then trained directly from these preference labels using a supervised contrastive preference learning objective, bypassing the need to learn explicit reward models. Through extensive experiments on robotic manipulation tasks from the MetaWorld, PLARE achieves performance on par with or surpassing existing state-of-the-art VLM-based reward generation methods. Furthermore, we demonstrate the effectiveness of PLARE in real-world manipulation tasks with a physical robot, further validating its practical applicability.
Problem

Research questions and friction points this paper is trying to address.

Eliminate need for manual reward design in offline RL
Use VLM feedback for policy training without rewards
Improve robotic manipulation performance without online data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages vision-language models for feedback
Uses preference labels instead of rewards
Trains policy with contrastive learning objective