🤖 AI Summary
This work explores the application of state-of-the-art video generation models to generalizable robotic manipulation. It proposes Veo-Act, a hierarchical framework that leverages the Veo-3 video generation model for high-level motion planning—a first in the field—and integrates an inverse dynamics model (IDM) to enable zero-shot, task-level trajectory generation. A vision-language-action (VLA) policy serves as the low-level executor to follow generated instructions. Notably, the approach requires no expert demonstrations and demonstrates effectiveness on both simulated and real high-dimensional dexterous hands. Experimental results show that Veo-3 combined with IDM produces plausible action sequences, and Veo-Act substantially enhances the instruction-following performance of VLA policies, revealing the potential of video generation models to establish a new paradigm in robot learning.
📝 Abstract
Video generation models have advanced rapidly and are beginning to show a strong understanding of physical dynamics. In this paper, we investigate how far an advanced video generation model such as Veo-3 can support generalizable robotic manipulation. We first study a zero-shot approach in which Veo-3 predicts future image sequences from current robot observations, while an inverse dynamics model IDM recovers the corresponding robot actions. The IDM is trained solely on random-play data, requiring neither human supervision nor expert demonstrations. The key intuition is that, if a video model can generate physically plausible future motions in image space, an IDM can translate those visual trajectories into executable robot actions. We evaluate this "Veo-3+IDM" approach in both simulation and the real world using a high-dimensional dexterous hand. We find that, owing to the strong generalization capability of frontier video models, Veo-3+IDM can consistently generate approximately correct task-level trajectories. However, its low-level control accuracy remains insufficient to solve most tasks reliably. Motivated by this observation, we develop a hierarchical framework, Veo-Act, which uses Veo-3 as a high-level motion planner and a VLA policy as the low-level executor, significantly improving the instruction-following performance of a state-of-the-art vision-language-action policy. Overall, our results suggest that, as video generation models continue to improve, video models can be a valuable component for generalizable robot learning.