RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the challenge of enhancing robotic manipulation capabilities using human demonstration data. We propose a two-stage vision-language-action (VLA) pretraining framework. In Stage I, we jointly train an image-to-video generator and a keypoint trajectory predictor on first-person human demonstration videos. In Stage II, we integrate ActionVAE to compress action sequences, enabling efficient joint modeling of visual, linguistic, and motor modalities. To our knowledge, this is the first VLA pretraining paradigm that unifies video generation and trajectory prediction—thereby substantially reducing action-space complexity and improving model initialization. After fine-tuning on multiple downstream robotic manipulation tasks, our method consistently outperforms state-of-the-art baselines across all benchmarks, demonstrating both effectiveness and strong generalization capability.

Technology Category

Application Category

📝 Abstract

This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulation videos to predict future frames conditioned on an initial frame and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby effectively bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.

Problem

Research questions and friction points this paper is trying to address.

Improving robot manipulation through human demonstration videos

Bridging visual frame prediction with action trajectory modeling

Compressing action sequences to reduce VLA output complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage video generative pretraining methodology

ActionVAE compresses action sequences into embeddings

Jointly predicts future frames and keypoint trajectories

🔎 Similar Papers

No similar papers found.