🤖 AI Summary
This work addresses the challenge that pretrained robotic policies, despite possessing the requisite motor skills, often fail to generalize under test-time spatial or semantic distribution shifts—such as changes in obstacles, support surfaces, or mild clutter. To overcome this limitation, the authors propose VLS, a training-free, inference-time adaptation framework that leverages a vision-language model to construct a differentiable reward function. This reward guides the sampling process of frozen diffusion or flow-matching policies without altering their parameters, effectively aligning generated actions with task and environmental constraints. VLS represents the first method enabling training-agnostic, zero-shot control of generative robotic policies by reframing adaptation as vision-language-informed sampling. The approach achieves significant performance gains—31% on CALVIN and 13% on LIBERO-PRO—and demonstrates robust adaptation to real-world distribution shifts on a physical Franka robot.
📝 Abstract
Why do pretrained diffusion or flow-matching policies fail when the same task is performed near an obstacle, on a shifted support surface, or amid mild clutter? Such failures rarely reflect missing motor skills; instead, they expose a limitation of imitation learning under train-test shifts, where action generation is tightly coupled to training-specific spatial configurations and task specifications. Retraining or fine-tuning to address these failures is costly and conceptually misaligned, as the required behaviors already exist but cannot be selectively adapted at test time. We propose Vision-Language Steering (VLS), a training-free framework for inference-time adaptation of frozen generative robot policies. VLS treats adaptation as an inference-time control problem, steering the sampling process of a pretrained diffusion or flow-matching policy in response to out-of-distribution observation-language inputs without modifying policy parameters. By leveraging vision-language models to synthesize trajectory-differentiable reward functions, VLS guides denoising toward action trajectories that satisfy test-time spatial and task requirements. Across simulation and real-world evaluations, VLS consistently outperforms prior steering methods, achieving a 31% improvement on CALVIN and a 13% gain on LIBERO-PRO. Real-world deployment on a Franka robot further demonstrates robust inference-time adaptation under test-time spatial and semantic shifts. Project page: https://vision-language-steering.github.io/webpage/