Revisiting the Learning Objectives of Vision-Language Reward Models

📅 2025-12-20

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work investigates the design of learning objectives for vision-language reward models (VLM-RMs) in embodied intelligence, aiming to isolate and quantify their true impact on the generalization capability of reward functions. Under a unified backbone architecture, dataset, and evaluation framework, we systematically ablate the learning objective variable—comparing multiple loss functions for the first time. Experiments in the Meta-World simulation environment show that a simple triplet loss built upon contrastive VLMs significantly outperforms state-of-the-art complex methods, achieving higher reward accuracy and stronger correlation with expert behavior progress. Crucially, we find that prior performance gains largely stem from improvements in data or model architecture—not from innovations in the objective function itself. This study establishes a reproducible benchmark for evaluating learning objectives in VLM-RMs and demonstrates the surprising effectiveness of minimalist objective design.

Technology Category

Application Category

📝 Abstract

Learning generalizable reward functions is a core challenge in embodied intelligence. Recent work leverages contrastive vision language models (VLMs) to obtain dense, domain-agnostic rewards without human supervision. These methods adapt VLMs into reward models through increasingly complex learning objectives, yet meaningful comparison remains difficult due to differences in training data, architectures, and evaluation settings. In this work, we isolate the impact of the learning objective by evaluating recent VLM-based reward models under a unified framework with identical backbones, finetuning data, and evaluation environments. Using Meta-World tasks, we assess modeling accuracy by measuring consistency with ground truth reward and correlation with expert progress. Remarkably, we show that a simple triplet loss outperforms state-of-the-art methods, suggesting that much of the improvements in recent approaches could be attributed to differences in data and architectures.

Problem

Research questions and friction points this paper is trying to address.

Evaluating learning objectives for vision-language reward models

Comparing reward models under a unified framework

Assessing modeling accuracy with ground truth consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework isolates learning objective impact

Simple triplet loss outperforms complex state-of-the-art methods

Evaluates reward models using Meta-World tasks consistency

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

2024-02-09European Conference on Computer VisionCitations: 29