🤖 AI Summary
Robots exhibit limited generalization capability in open-world environments, hindering real-world deployment and progress toward artificial general intelligence. While existing vision-language-action (VLA) models leverage large language and vision models to improve instruction understanding, they still struggle with cross-task, cross-object, and cross-scene generalization in robotic manipulation. This work introduces the first paradigm that repurposes large-scale video generation models into general-purpose VLA systems. We propose a dual-prediction framework that jointly forecasts action sequences and their corresponding future visual frames—enabling skill transfer via “visual imagination.” Built upon a multimodal Diffusion Transformer architecture, our model unifies video, language, and action modalities within a single generative framework. Experiments demonstrate that high-fidelity visual prediction substantially improves action reliability and task success rates. Notably, our approach achieves strong zero-shot generalization on real robotic platforms, offering a novel embodied learning paradigm for intelligent agents.
📝 Abstract
Generalization in robot manipulation is essential for deploying robots in open-world environments and advancing toward artificial general intelligence. While recent Vision-Language-Action (VLA) models leverage large pre-trained understanding models for perception and instruction following, their ability to generalize to novel tasks, objects, and settings remains limited. In this work, we present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators. Given a language instruction and an image, VideoVLA predicts an action sequence as well as the future visual outcomes. Built on a multi-modal Diffusion Transformer, VideoVLA jointly models video, language, and action modalities, using pre-trained video generative models for joint visual and action forecasting. Our experiments show that high-quality imagined futures correlate with reliable action predictions and task success, highlighting the importance of visual imagination in manipulation. VideoVLA demonstrates strong generalization, including imitating other embodiments' skills and handling novel objects. This dual-prediction strategy - forecasting both actions and their visual consequences - explores a paradigm shift in robot learning and unlocks generalization capabilities in manipulation systems.