XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language-action (VLA) models face two key challenges: (1) precise mapping from high-dimensional visual perception to low-level robot actions, and (2) a substantial domain gap between heterogeneous robotic embodiments and human demonstrations. To address these, we propose Unified Vision-Motion Coding (UVMC), a dual-branch VQ-VAE framework that jointly models visual dynamics and action sequences. UVMC employs a three-stage training paradigm—self-supervised representation learning, cross-modal pretraining, and task-specific fine-tuning—to align and transfer knowledge from diverse sources. Evaluated across six real-world robotic platforms, over 14,000 physical trials, and more than 120 manipulation tasks, UVMC significantly outperforms state-of-the-art methods including π₀.₅, RDT, and UniVLA. It is the first VLA model to achieve generalization across robot morphologies, tasks, and environments, demonstrating strong zero-shot transfer capability and practical deployability.

Technology Category

Application Category

📝 Abstract
Recent progress in large-scale robotic datasets and vision-language models (VLMs) has advanced research on vision-language-action (VLA) models. However, existing VLA models still face two fundamental challenges: (i) producing precise low-level actions from high-dimensional observations, (ii) bridging domain gaps across heterogeneous data sources, including diverse robot embodiments and human demonstrations. Existing methods often encode latent variables from either visual dynamics or robotic actions to guide policy learning, but they fail to fully exploit the complementary multi-modal knowledge present in large-scale, heterogeneous datasets. In this work, we present X Robotic Model 1 (XR-1), a novel framework for versatile and scalable VLA learning across diverse robots, tasks, and environments. XR-1 introduces the emph{Unified Vision-Motion Codes (UVMC)}, a discrete latent representation learned via a dual-branch VQ-VAE that jointly encodes visual dynamics and robotic motion. UVMC addresses these challenges by (i) serving as an intermediate representation between the observations and actions, and (ii) aligning multimodal dynamic information from heterogeneous data sources to capture complementary knowledge. To effectively exploit UVMC, we propose a three-stage training paradigm: (i) self-supervised UVMC learning, (ii) UVMC-guided pretraining on large-scale cross-embodiment robotic datasets, and (iii) task-specific post-training. We validate XR-1 through extensive real-world experiments with more than 14,000 rollouts on six different robot embodiments, spanning over 120 diverse manipulation tasks. XR-1 consistently outperforms state-of-the-art baselines such as $pi_{0.5}$, $pi_0$, RDT, UniVLA, and GR00T-N1.5 while demonstrating strong generalization to novel objects, background variations, distractors, and illumination changes. Our project is at https://xr-1-vla.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Producing precise low-level actions from high-dimensional visual observations
Bridging domain gaps across heterogeneous robot embodiments and data sources
Exploiting complementary multimodal knowledge in large-scale robotic datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning unified vision-motion codes via dual-branch VQ-VAE
Using intermediate representation between observations and actions
Three-stage training with cross-embodiment robotic datasets
🔎 Similar Papers
S
Shichao Fan
Beijing Innovation Center of Humanoid Robotics
K
Kun Wu
Beijing Innovation Center of Humanoid Robotics
Zhengping Che
Zhengping Che
X-Humanoid
Embodied AIDeep Learning
X
Xinhua Wang
Beijing Innovation Center of Humanoid Robotics
D
Di Wu
Beijing Innovation Center of Humanoid Robotics
F
Fei Liao
Beijing Innovation Center of Humanoid Robotics
N
Ning Liu
Beijing Innovation Center of Humanoid Robotics
Y
Yixue Zhang
Beijing Innovation Center of Humanoid Robotics
Z
Zhen Zhao
Beijing Innovation Center of Humanoid Robotics
Z
Zhiyuan Xu
Beijing Innovation Center of Humanoid Robotics
M
Meng Li
Beijing Innovation Center of Humanoid Robotics
Qingjie Liu
Qingjie Liu
Professor, School of Computer Science and Engineering, Beihang University
Computer Vision and Pattern Recognition
Shan-Shan Zhang
Shan-Shan Zhang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
M
Min Wan
School of Mechanical Engineering and Automation, Beihang University
J
Jian Tang
Beijing Innovation Center of Humanoid Robotics