🤖 AI Summary
This work addresses the poor cross-platform and cross-dataset generalization of generic vision-language-action (VLA) models. To this end, we propose a scalable architecture based on soft prompts: for each category of heterogeneous robotic data—differing in platform, environment, or task—we introduce an independent, learnable soft prompt embedding that conditions a standard Transformer encoder; we replace task-specific decoders with a flow-matching mechanism for end-to-end action modeling. Crucially, our method integrates multi-source embodied data without modifying the backbone network, significantly enhancing adaptability to diverse robot morphologies and tasks. Experiments across six simulation environments and three real-world robotic platforms demonstrate that our 0.9B-parameter model outperforms prior approaches on multiple benchmarks, exhibiting strong generalization and rapid zero-shot or few-shot adaptation capabilities.
📝 Abstract
Successful generalist Vision-Language-Action (VLA) models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich, diverse robotic data sources, we propose a novel Soft Prompt approach with minimally added parameters, by infusing prompt learning concepts into cross-embodiment robot learning and introducing separate sets of learnable embeddings for each distinct data source. These embeddings serve as embodiment-specific prompts, which in unity empower VLA models with effective exploitation of varying cross-embodiment features. Our new X-VLA, a neat flow-matching-based VLA architecture, relies exclusively on soft-prompted standard Transformer encoders, enjoying both scalability and simplicity. Evaluated across 6 simulations as well as 3 real-world robots, our 0.9B instantiation-X-VLA-0.9B simultaneously achieves SOTA performance over a sweep of benchmarks, demonstrating superior results on a wide axes of capabilities, from flexible dexterity to quick adaptation across embodiments, environments, and tasks. Website: https://thu-air-dream.github.io/X-VLA/