Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work explores the application of state-of-the-art video generation models to generalizable robotic manipulation. It proposes Veo-Act, a hierarchical framework that leverages the Veo-3 video generation model for high-level motion planning—a first in the field—and integrates an inverse dynamics model (IDM) to enable zero-shot, task-level trajectory generation. A vision-language-action (VLA) policy serves as the low-level executor to follow generated instructions. Notably, the approach requires no expert demonstrations and demonstrates effectiveness on both simulated and real high-dimensional dexterous hands. Experimental results show that Veo-3 combined with IDM produces plausible action sequences, and Veo-Act substantially enhances the instruction-following performance of VLA policies, revealing the potential of video generation models to establish a new paradigm in robot learning.
📝 Abstract
Video generation models have advanced rapidly and are beginning to show a strong understanding of physical dynamics. In this paper, we investigate how far an advanced video generation model such as Veo-3 can support generalizable robotic manipulation. We first study a zero-shot approach in which Veo-3 predicts future image sequences from current robot observations, while an inverse dynamics model IDM recovers the corresponding robot actions. The IDM is trained solely on random-play data, requiring neither human supervision nor expert demonstrations. The key intuition is that, if a video model can generate physically plausible future motions in image space, an IDM can translate those visual trajectories into executable robot actions. We evaluate this "Veo-3+IDM" approach in both simulation and the real world using a high-dimensional dexterous hand. We find that, owing to the strong generalization capability of frontier video models, Veo-3+IDM can consistently generate approximately correct task-level trajectories. However, its low-level control accuracy remains insufficient to solve most tasks reliably. Motivated by this observation, we develop a hierarchical framework, Veo-Act, which uses Veo-3 as a high-level motion planner and a VLA policy as the low-level executor, significantly improving the instruction-following performance of a state-of-the-art vision-language-action policy. Overall, our results suggest that, as video generation models continue to improve, video models can be a valuable component for generalizable robot learning.
Problem

Research questions and friction points this paper is trying to address.

video generation models
robot manipulation
generalization
zero-shot learning
dexterous hand
Innovation

Methods, ideas, or system contributions that make the work stand out.

video generation models
inverse dynamics model
zero-shot robot manipulation
hierarchical control
vision-language-action policy
🔎 Similar Papers
No similar papers found.
Z
Zhongru Zhang
Tsinghua University
C
Chenghan Yang
Tsinghua University
Q
Qingzhou Lu
Tsinghua University
Yanjiang Guo
Yanjiang Guo
Tsinghua University
Embodied AIGenerative Model
Jianke Zhang
Jianke Zhang
Tsinghua University, IIIS
Embodied AI. VLM. Multimodal Learning
Y
Yucheng Hu
Tsinghua University
Jianyu Chen
Jianyu Chen
Assistant Professor, Tsinghua University
AIRobotics