TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

📅 2026-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of low sample efficiency and sparse rewards in real-world robotic reinforcement learning, compounded by the limited generalization of existing process reward models. The authors propose a novel temporal value function derived from the internal token probabilities of pretrained video vision-language models (e.g., Qwen3-VL), which enables zero-shot extraction of task progress signals directly from token logits—bypassing the distortion inherent in conventional numerical prompting. Evaluated on over 130 real-world robotic tasks, the method achieves an average ordinal correlation of 0.947, substantially outperforming the GVL baseline. Furthermore, it demonstrates strong generalization and practical utility by successfully enabling success detection and reward-aligned behavioral cloning without task-specific fine-tuning.

Technology Category

Application Category

📝 Abstract
While Vision-Language-Action (VLA) models have seen rapid progress in pretraining, their advancement in Reinforcement Learning (RL) remains hampered by low sample efficiency and sparse rewards in real-world settings. Developing generalizable process reward models is essential for providing the fine-grained feedback necessary to bridge this gap, yet existing temporal value functions often fail to generalize beyond their training domains. We introduce TOPReward, a novel, probabilistically grounded temporal value function that leverages the latent world knowledge of pretrained video Vision-Language Models (VLMs) to estimate robotic task progress. Unlike prior methods that prompt VLMs to directly output progress values, which are prone to numerical misrepresentation, TOPReward extracts task progress directly from the VLM's internal token logits. In zero-shot evaluations across 130+ distinct real-world tasks and multiple robot platforms (e.g., Franka, YAM, SO-100/101), TOPReward achieves 0.947 mean Value-Order Correlation (VOC) on Qwen3-VL, dramatically outperforming the state-of-the-art GVL baseline which achieves near-zero correlation on the same open-source model. We further demonstrate that TOPReward serves as a versatile tool for downstream applications, including success detection and reward-aligned behavior cloning.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Sparse Rewards
Sample Efficiency
Generalizable Reward Models
Vision-Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token Probabilities
Zero-Shot Reward
Vision-Language Models
Temporal Value Function
Robotic Task Progress
🔎 Similar Papers
No similar papers found.