Rewarding DINO: Predicting Dense Rewards with Vision Foundation Models

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work proposes a language-conditioned visual reward modeling approach to address the challenge that conventional dense reward functions, which rely on privileged simulation states, are difficult to deploy on real robots. By leveraging the DINO vision foundation model in conjunction with natural language instructions and a ranking loss, the method directly predicts task-semantic-aware dense rewards from images without requiring expert demonstrations. The resulting lightweight, general-purpose reward function can seamlessly replace handcrafted analytical rewards. Evaluated on Meta-World+, the trained reward model not only achieves strong performance on seen tasks but also effectively generalizes to novel environments—including both simulated and real-world settings—and successfully integrates with off-the-shelf reinforcement learning algorithms to accomplish manipulation tasks.

Technology Category

Application Category

📝 Abstract

Well-designed dense reward functions in robot manipulation not only indicate whether a task is completed but also encode progress along the way. Generally, designing dense rewards is challenging and usually requires access to privileged state information available only in simulation, not in real-world experiments. This makes reward prediction models that infer task state information from camera images attractive. A common approach is to predict rewards from expert demonstrations based on visual similarity or sequential frame ordering. However, this biases the resulting reward function towards a specific solution and leaves it undefined in states not covered by the demonstrations. In this work, we introduce Rewarding DINO, a method for language-conditioned reward modeling that learns actual reward functions rather than specific trajectories. The model's compact size allows it to serve as a direct replacement for analytical reward functions with comparatively low computational overhead. We train our model on data sampled from 24 Meta-World+ tasks using a rank-based loss and evaluate pairwise accuracy, rank correlation, and calibration. Rewarding DINO achieves competitive performance in tasks from the training set and generalizes to new settings in simulation and the real world, indicating that it learns task semantics. We also test the model with off-the-shelf reinforcement learning algorithms to solve tasks from our Meta-World+ training set.

Problem

Research questions and friction points this paper is trying to address.

dense reward

robot manipulation

reward prediction

vision-based reward

generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

dense reward prediction

vision foundation models

language-conditioned reward modeling