MoS-VLA: A Vision-Language-Action Model with One-Shot Skill Adaptation

๐Ÿ“… 2025-10-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current vision-language-action (VLA) models exhibit poor generalization across environments, tasks, and robotic platforms, hindering direct deployment in novel scenarios. To address this, we propose a lightweight, one-shot adaptation framework: robot policies are modeled as linear combinations of learnable basis functions, enabling gradient-free skill inference via L1-regularized convex optimization from a single demonstration. A hybrid skill architecture is jointly pre-trained on the Open X-Embodiment multi-source dataset to construct a structured, reusable skill space. Experiments demonstrate that our method achieves significantly lower action prediction error than state-of-the-art VLA models across five unseen benchmarks. Moreover, it successfully executes tasks in both simulation and real-world robotic settingsโ€”where baseline VLA models fail entirely. This work advances practical robot policy adaptation by combining structured representation learning with efficient, optimization-based few-shot inference.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision-Language-Action (VLA) models trained on large robot datasets promise general-purpose, robust control across diverse domains and embodiments. However, existing approaches often fail out-of-the-box when deployed in novel environments, embodiments, or tasks. We introduce Mixture of Skills VLA (MoS-VLA), a framework that represents robot manipulation policies as linear combinations of a finite set of learned basis functions. During pretraining, MoS-VLA jointly learns these basis functions across datasets from the Open X-Embodiment project, producing a structured skill space. At test time, adapting to a new task requires only a single expert demonstration. The corresponding skill representation is then inferred via a lightweight convex optimization problem that minimizes the L1 action error, without requiring gradient updates. This gradient-free adaptation incurs minimal overhead while enabling rapid instantiation of new skills. Empirically, MoS-VLA achieves lower action-prediction error on five out of five unseen datasets and succeeds in both simulation and real-robot tasks where a pretrained VLA model fails outright. Project page: mos-vla.github.io/
Problem

Research questions and friction points this paper is trying to address.

Adapts robot policies to new tasks with one demonstration
Enables gradient-free skill adaptation via convex optimization
Improves action prediction across diverse unseen robot datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Skills VLA represents policies as linear combinations
Learns basis functions across datasets for structured skill space
Adapts to new tasks via single demonstration and convex optimization
๐Ÿ”Ž Similar Papers
No similar papers found.