🤖 AI Summary
This work addresses the challenge of enhancing robotic manipulation capabilities using human demonstration data. We propose a two-stage vision-language-action (VLA) pretraining framework. In Stage I, we jointly train an image-to-video generator and a keypoint trajectory predictor on first-person human demonstration videos. In Stage II, we integrate ActionVAE to compress action sequences, enabling efficient joint modeling of visual, linguistic, and motor modalities. To our knowledge, this is the first VLA pretraining paradigm that unifies video generation and trajectory prediction—thereby substantially reducing action-space complexity and improving model initialization. After fine-tuning on multiple downstream robotic manipulation tasks, our method consistently outperforms state-of-the-art baselines across all benchmarks, demonstrating both effectiveness and strong generalization capability.
📝 Abstract
This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulation videos to predict future frames conditioned on an initial frame and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby effectively bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.