🤖 AI Summary
This study investigates whether vision-language-action (VLA) policies trained solely via imitation learning implicitly encode information predictive of task success, despite the absence of explicit reward prediction in their training objective. The authors train lightweight linear probes on frozen VLA features—such as those from OpenVLA and Pi0.5—using Monte Carlo returns from mixed success-failure trajectories as supervision, evaluating predictions at matched timesteps within the same tasks. Their results demonstrate that VLAs inherently contain success-predictive signals that are linearly decodable, challenging the conventional view that imitation-learned policies lack value-related information. Notably, the Pi0.5 probe achieves 92% accuracy in pairwise success ranking and, when deployed as a test-time action selector, improves task success rates from 26.7% to 44.3% on the push-plate task, with consistent gains observed on the wine-rack task.
📝 Abstract
Vision--language--action (VLA) policies are trained to imitate actions; their loss never asks them to estimate reward, progress, or future success. Their frozen representations nevertheless carry such information, and it can be read out and used to guide action choice without retraining the policy. From mixed successful and failed manipulation trajectories on LIBERO-Goal, we recover Monte-Carlo outcome targets using lightweight linear probes on frozen features. The targets are consistently predictable from OpenVLA, Pi0.5, DINOv2, and CLIP features, and substantially less so from baselines built on progress, time-to-go, task identity, or proprioception. To rule out task and temporal shortcuts, we evaluate the probes under same-task, same-timestep matched comparisons: Pi0.5 probes still reach roughly 92% pairwise ordering accuracy, while label-shuffled controls stay at chance. Used as a test-time selector over sampled Pi0.5 action prefixes, the same probe turns this offline finding into behavior: on push-plate, success rises from 26.7% under greedy decoding to 44.3%, with a second positive case on wine-rack. The gains are not universal and require additional inference compute, but the underlying finding is clean: frozen VLAs already encode information about success that their imitation objective never explicitly demands.