X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the poor cross-platform and cross-dataset generalization of generic vision-language-action (VLA) models. To this end, we propose a scalable architecture based on soft prompts: for each category of heterogeneous robotic data—differing in platform, environment, or task—we introduce an independent, learnable soft prompt embedding that conditions a standard Transformer encoder; we replace task-specific decoders with a flow-matching mechanism for end-to-end action modeling. Crucially, our method integrates multi-source embodied data without modifying the backbone network, significantly enhancing adaptability to diverse robot morphologies and tasks. Experiments across six simulation environments and three real-world robotic platforms demonstrate that our 0.9B-parameter model outperforms prior approaches on multiple benchmarks, exhibiting strong generalization and rapid zero-shot or few-shot adaptation capabilities.

Technology Category

Application Category

📝 Abstract

Successful generalist Vision-Language-Action (VLA) models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich, diverse robotic data sources, we propose a novel Soft Prompt approach with minimally added parameters, by infusing prompt learning concepts into cross-embodiment robot learning and introducing separate sets of learnable embeddings for each distinct data source. These embeddings serve as embodiment-specific prompts, which in unity empower VLA models with effective exploitation of varying cross-embodiment features. Our new X-VLA, a neat flow-matching-based VLA architecture, relies exclusively on soft-prompted standard Transformer encoders, enjoying both scalability and simplicity. Evaluated across 6 simulations as well as 3 real-world robots, our 0.9B instantiation-X-VLA-0.9B simultaneously achieves SOTA performance over a sweep of benchmarks, demonstrating superior results on a wide axes of capabilities, from flexible dexterity to quick adaptation across embodiments, environments, and tasks. Website: https://thu-air-dream.github.io/X-VLA/

Problem

Research questions and friction points this paper is trying to address.

Developing scalable cross-embodiment vision-language-action models

Leveraging heterogeneous robotic datasets with minimal parameters

Achieving superior adaptation across embodiments, environments, and tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Soft Prompt approach with minimal added parameters

Separate learnable embeddings for each data source

Flow-matching-based VLA architecture using Transformer encoders

🔎 Similar Papers

A Survey on Vision-Language-Action Models for Embodied AI