🤖 AI Summary
Existing large vision-language-action (VLA) models exhibit significant limitations in modeling spatiotemporal dynamics for complex manipulation tasks—e.g., grasping. To address this, we propose Visual Trajectory Prompting (VTP), a novel paradigm that encodes state-action sequences into compact visual trajectory cues, explicitly enhancing VLA models’ spatiotemporal reasoning capability. Our approach leverages a self-collected dataset of 150K manipulation trajectories, combined with trajectory-to-image encoding and multi-scenario joint fine-tuning. Applied to the lightweight 4B Phi-3-Vision VLA model, VTP achieves real-robot performance and generalization on par with the significantly larger 7B OpenVLA. On SimplerEnv’s 137 task configurations, VTP outperforms OpenVLA by 10% in success rate; on physical WidowX robots, it improves task success by 3.5×. Moreover, VTP enables robust cross-morphology and cross-scenario transfer, demonstrating unprecedented adaptability for resource-constrained VLA deployment.
📝 Abstract
Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. In this work, we introduce visual trace prompting, a simple yet effective approach to facilitate VLA models' spatial-temporal awareness for action prediction by encoding state-action trajectories visually. We develop a new TraceVLA model by finetuning OpenVLA on our own collected dataset of 150K robot manipulation trajectories using visual trace prompting. Evaluations of TraceVLA across 137 configurations in SimplerEnv and 4 tasks on a physical WidowX robot demonstrate state-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and 3.5x on real-robot tasks and exhibiting robust generalization across diverse embodiments and scenarios. To further validate the effectiveness and generality of our method, we present a compact VLA model based on 4B Phi-3-Vision, pretrained on the Open-X-Embodiment and finetuned on our dataset, rivals the 7B OpenVLA baseline while significantly improving inference efficiency.