RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of enhancing robotic manipulation capabilities using human demonstration data. We propose a two-stage vision-language-action (VLA) pretraining framework. In Stage I, we jointly train an image-to-video generator and a keypoint trajectory predictor on first-person human demonstration videos. In Stage II, we integrate ActionVAE to compress action sequences, enabling efficient joint modeling of visual, linguistic, and motor modalities. To our knowledge, this is the first VLA pretraining paradigm that unifies video generation and trajectory prediction—thereby substantially reducing action-space complexity and improving model initialization. After fine-tuning on multiple downstream robotic manipulation tasks, our method consistently outperforms state-of-the-art baselines across all benchmarks, demonstrating both effectiveness and strong generalization capability.

Technology Category

Application Category

📝 Abstract
This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulation videos to predict future frames conditioned on an initial frame and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby effectively bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.
Problem

Research questions and friction points this paper is trying to address.

Improving robot manipulation through human demonstration videos
Bridging visual frame prediction with action trajectory modeling
Compressing action sequences to reduce VLA output complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage video generative pretraining methodology
ActionVAE compresses action sequences into embeddings
Jointly predicts future frames and keypoint trajectories
🔎 Similar Papers
No similar papers found.
Y
Yuming Jiang
DAMO Academy, Alibaba Group
Siteng Huang
Siteng Huang
Alibaba DAMO Academy | ZJU | Westlake University
Vision-language ModelsGenerative ModelsEmbodied AI
S
Shengke Xue
DAMO Academy, Alibaba Group
Y
Yaxi Zhao
DAMO Academy, Alibaba Group
J
Jun Cen
DAMO Academy, Alibaba Group
Sicong Leng
Sicong Leng
Nanyang Technological University
Multi-modal Learning
Kehan Li
Kehan Li
Stanford University
Jiayan Guo
Jiayan Guo
Alibaba DAMO Academy, Peking University
LLMMLLMEmbodied AIAgentsRecommender System
K
Kexiang Wang
DAMO Academy, Alibaba Group
M
Mingxiu Chen
DAMO Academy, Alibaba Group
F
Fan Wang
DAMO Academy, Alibaba Group
Deli Zhao
Deli Zhao
Alibaba DAMO Academy
generative modelsmultimodal learningfoundation models
X
Xin Li
DAMO Academy, Alibaba Group