CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the challenge of simultaneously improving performance and minimizing adaptation costs when fine-tuning pretrained vision-language-action (VLA) models under standard supervised settings. The authors propose a "capability vector" approach: by fine-tuning two lightweight models on a small set of tasks—one enhancing general capabilities and the other fitting task-specific action distributions—their parameter difference yields a transferable capability vector. This vector is then fused with the pretrained model parameters to construct a compact, efficient, and highly generalizable meta-model. Requiring only standard fine-tuning procedures, the method leverages parameter-space disentanglement and lightweight orthogonal regularization to achieve performance comparable to complex auxiliary fine-tuning strategies, while demonstrating strong cross-model, cross-environment, and cross-embodiment generalization with significantly reduced computational overhead.

📝 Abstract

This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary objectives. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary-objective SFT within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver the goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies, resulting in two finetuned models. The parameters' difference between the two models can then be interpreted as capability vectors provided by auxiliary objectives. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Internal and external experiments demonstrate that our capability vectors (1) are effective and versatile across diverse models, (2) can generalize to novel environments and embodiments out of the box.

Problem

Research questions and friction points this paper is trying to address.

vision-language-action models

supervised fine-tuning

adaptation cost

performance improvement

computational overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

capability vectors

parameter space decoupling

vision-language-action models

efficient finetuning

transferable representation

🔎 Similar Papers

No similar papers found.

Bosch Group

bangalore, IN

AI Research Scientist, VLM (vision language models)