Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

To address the high cost and poor scalability of human feedback in preference-based reinforcement learning, this paper proposes PrefVLM: the first framework to integrate vision-language models (VLMs) into preference learning. It introduces an uncertainty-aware automated preference labeling mechanism, coupled with selective human annotation to minimize human intervention. Furthermore, a self-supervised inverse dynamics loss is designed for task-adaptive fine-tuning of the VLM, enabling cross-task knowledge transfer. Evaluated on Meta-World manipulation tasks, PrefVLM achieves or surpasses state-of-the-art performance while reducing human annotation effort by 50%. The fine-tuned VLM significantly improves feedback efficiency and generalization on unseen tasks. Overall, PrefVLM establishes a novel paradigm for low-feedback, robust preference learning—bridging scalable automation with targeted human oversight.

Technology Category

Application Category

📝 Abstract

Preference-based reinforcement learning (RL) offers a promising approach for aligning policies with human intent but is often constrained by the high cost of human feedback. In this work, we introduce PrefVLM, a framework that integrates Vision-Language Models (VLMs) with selective human feedback to significantly reduce annotation requirements while maintaining performance. Our method leverages VLMs to generate initial preference labels, which are then filtered to identify uncertain cases for targeted human annotation. Additionally, we adapt VLMs using a self-supervised inverse dynamics loss to improve alignment with evolving policies. Experiments on Meta-World manipulation tasks demonstrate that PrefVLM achieves comparable or superior success rates to state-of-the-art methods while using up to 2 x fewer human annotations. Furthermore, we show that adapted VLMs enable efficient knowledge transfer across tasks, further minimizing feedback needs. Our results highlight the potential of combining VLMs with selective human supervision to make preference-based RL more scalable and practical.

Problem

Research questions and friction points this paper is trying to address.

Preference-based Reinforcement Learning

Scalability

Reduced Human Feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

PrefVLM

Cross-task Learning

Human Feedback Optimization

🔎 Similar Papers

No similar papers found.