RynnVLA-002: A Unified Vision-Language-Action and World Model

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the challenge of jointly modeling environmental dynamics and embodied action planning. We propose a unified visual-language-action (VLA) and world model framework, featuring a bidirectional co-enhancement architecture that end-to-end integrates visual encoding, language instruction understanding, action sequence generation, and vision-based dynamical world model prediction. Crucially, the framework enables fully self-supervised training without reliance on pretrained models. Our key contribution is the first joint optimization of VLA and world model within a shared representational space, simultaneously improving visual perception, action generation, and future state prediction. Experiments demonstrate state-of-the-art performance: 97.4% task success rate on the LIBERO simulation benchmark; and a 50% absolute improvement in overall success rate when integrated into the real-world LeRobot platform—validating strong generalization and practical efficacy.

Technology Category

Application Category

📝 Abstract

We introduce RynnVLA-002, a unified Vision-Language-Action (VLA) and world model. The world model leverages action and visual inputs to predict future image states, learning the underlying physics of the environment to refine action generation. Conversely, the VLA model produces subsequent actions from image observations, enhancing visual understanding and supporting the world model's image generation. The unified framework of RynnVLA-002 enables joint learning of environmental dynamics and action planning. Our experiments show that RynnVLA-002 surpasses individual VLA and world models, demonstrating their mutual enhancement. We evaluate RynnVLA-002 in both simulation and real-world robot tasks. RynnVLA-002 achieves 97.4% success rate on the LIBERO simulation benchmark without pretraining, while in real-world LeRobot experiments, its integrated world model boosts the overall success rate by 50%.

Problem

Research questions and friction points this paper is trying to address.

Unifying vision-language-action models with world models for joint learning

Predicting future image states by learning environmental physics dynamics

Enhancing robot action generation through integrated visual understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified vision-language-action and world model framework

World model predicts future states using visual inputs

VLA generates actions from images for planning

🔎 Similar Papers

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions