Training-free Generation of Temporally Consistent Rewards from VLMs

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Vision-language models (VLMs) struggle to deliver temporally consistent, high-precision, low-latency reward signals for robot manipulation—due to their lack of robotics-specific knowledge in pretraining and the high cost and deployment latency of fine-tuning. Method: We propose T²-VLM, a zero-shot framework that requires no fine-tuning. It constructs spatially aware subgoal representations and integrates VLM zero-shot querying with Bayesian latent-state tracking to dynamically estimate subgoal completion progress over time, yielding structured, temporally consistent sparse rewards. Results: T²-VLM achieves state-of-the-art performance on two robot manipulation benchmarks: reward accuracy improves significantly, inference overhead drops by 62%, and real-time closed-loop control at 50 Hz is enabled. Moreover, it enhances policy robustness and fault tolerance in long-horizon tasks.

Technology Category

Application Category

📝 Abstract

Recent advances in vision-language models (VLMs) have significantly improved performance in embodied tasks such as goal decomposition and visual comprehension. However, providing accurate rewards for robotic manipulation without fine-tuning VLMs remains challenging due to the absence of domain-specific robotic knowledge in pre-trained datasets and high computational costs that hinder real-time applicability. To address this, we propose $mathrm{T}^2$-VLM, a novel training-free, temporally consistent framework that generates accurate rewards through tracking the status changes in VLM-derived subgoals. Specifically, our method first queries the VLM to establish spatially aware subgoals and an initial completion estimate before each round of interaction. We then employ a Bayesian tracking algorithm to update the goal completion status dynamically, using subgoal hidden states to generate structured rewards for reinforcement learning (RL) agents. This approach enhances long-horizon decision-making and improves failure recovery capabilities with RL. Extensive experiments indicate that $mathrm{T}^2$-VLM achieves state-of-the-art performance in two robot manipulation benchmarks, demonstrating superior reward accuracy with reduced computation consumption. We believe our approach not only advances reward generation techniques but also contributes to the broader field of embodied AI. Project website: https://t2-vlm.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Generates accurate rewards for robotic manipulation without VLM fine-tuning

Addresses lack of domain-specific robotic knowledge in pre-trained VLMs

Reduces computational costs for real-time reward generation in RL

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free VLM framework for rewards

Bayesian tracking for dynamic goal updates

Spatially aware subgoals enhance RL performance

🔎 Similar Papers

No similar papers found.