X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

📅 2025-10-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the poor cross-platform and cross-dataset generalization of generic vision-language-action (VLA) models. To this end, we propose a scalable architecture based on soft prompts: for each category of heterogeneous robotic data—differing in platform, environment, or task—we introduce an independent, learnable soft prompt embedding that conditions a standard Transformer encoder; we replace task-specific decoders with a flow-matching mechanism for end-to-end action modeling. Crucially, our method integrates multi-source embodied data without modifying the backbone network, significantly enhancing adaptability to diverse robot morphologies and tasks. Experiments across six simulation environments and three real-world robotic platforms demonstrate that our 0.9B-parameter model outperforms prior approaches on multiple benchmarks, exhibiting strong generalization and rapid zero-shot or few-shot adaptation capabilities.

Technology Category

Application Category

📝 Abstract
Successful generalist Vision-Language-Action (VLA) models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich, diverse robotic data sources, we propose a novel Soft Prompt approach with minimally added parameters, by infusing prompt learning concepts into cross-embodiment robot learning and introducing separate sets of learnable embeddings for each distinct data source. These embeddings serve as embodiment-specific prompts, which in unity empower VLA models with effective exploitation of varying cross-embodiment features. Our new X-VLA, a neat flow-matching-based VLA architecture, relies exclusively on soft-prompted standard Transformer encoders, enjoying both scalability and simplicity. Evaluated across 6 simulations as well as 3 real-world robots, our 0.9B instantiation-X-VLA-0.9B simultaneously achieves SOTA performance over a sweep of benchmarks, demonstrating superior results on a wide axes of capabilities, from flexible dexterity to quick adaptation across embodiments, environments, and tasks. Website: https://thu-air-dream.github.io/X-VLA/
Problem

Research questions and friction points this paper is trying to address.

Developing scalable cross-embodiment vision-language-action models
Leveraging heterogeneous robotic datasets with minimal parameters
Achieving superior adaptation across embodiments, environments, and tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Soft Prompt approach with minimal added parameters
Separate learnable embeddings for each data source
Flow-matching-based VLA architecture using Transformer encoders
🔎 Similar Papers
No similar papers found.
Jinliang Zheng
Jinliang Zheng
Tsinghua University
Computer VisionEmbodied AI
J
Jianxiong Li
Institute for AI Industry Research (AIR), Tsinghua University
Zhihao Wang
Zhihao Wang
Peking University
RoboticsReinforcement Learning
Dongxiu Liu
Dongxiu Liu
Beijing University of Posts and Telecommunications
Robot ManipulationTask PlanningComputer Vision
X
Xirui Kang
Institute for AI Industry Research (AIR), Tsinghua University
Y
Yuchun Feng
Institute for AI Industry Research (AIR), Tsinghua University
Yinan Zheng
Yinan Zheng
Tsinghua University
Reinforcement LearningDiffusion ModelsAutonomous DrivingRobotics
J
Jiayin Zou
Institute for AI Industry Research (AIR), Tsinghua University
Y
Yilun Chen
Shanghai AI Lab
J
Jia Zeng
Shanghai AI Lab
Y
Ya-Qin Zhang
Institute for AI Industry Research (AIR), Tsinghua University
J
Jiangmiao Pang
Shanghai AI Lab
J
Jingjing Liu
Institute for AI Industry Research (AIR), Tsinghua University
Tai Wang
Tai Wang
Shanghai AI Laboratory
Computer Vision3D VisionEmbodied AIDeep Learning
Xianyuan Zhan
Xianyuan Zhan
Associate Professor, Institute for AI Industry Research (AIR), Tsinghua University
Data-driven decision-makingReal-world RL/ILEmbodied AIAutonomous Driving