🤖 AI Summary
This work addresses the challenge of jointly modeling environmental dynamics and embodied action planning. We propose a unified visual-language-action (VLA) and world model framework, featuring a bidirectional co-enhancement architecture that end-to-end integrates visual encoding, language instruction understanding, action sequence generation, and vision-based dynamical world model prediction. Crucially, the framework enables fully self-supervised training without reliance on pretrained models. Our key contribution is the first joint optimization of VLA and world model within a shared representational space, simultaneously improving visual perception, action generation, and future state prediction. Experiments demonstrate state-of-the-art performance: 97.4% task success rate on the LIBERO simulation benchmark; and a 50% absolute improvement in overall success rate when integrated into the real-world LeRobot platform—validating strong generalization and practical efficacy.
📝 Abstract
We introduce RynnVLA-002, a unified Vision-Language-Action (VLA) and world model. The world model leverages action and visual inputs to predict future image states, learning the underlying physics of the environment to refine action generation. Conversely, the VLA model produces subsequent actions from image observations, enhancing visual understanding and supporting the world model's image generation. The unified framework of RynnVLA-002 enables joint learning of environmental dynamics and action planning. Our experiments show that RynnVLA-002 surpasses individual VLA and world models, demonstrating their mutual enhancement. We evaluate RynnVLA-002 in both simulation and real-world robot tasks. RynnVLA-002 achieves 97.4% success rate on the LIBERO simulation benchmark without pretraining, while in real-world LeRobot experiments, its integrated world model boosts the overall success rate by 50%.