Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the challenge of simultaneously improving performance and adaptation efficiency when fine-tuning pretrained vision–language–action (VLA) models under standard supervised settings. To this end, the authors propose a parameter-space disentangled meta-learning framework that decouples general-purpose capabilities acquired from auxiliary tasks from task-specific distributions via a capability vector fusion mechanism, enabling efficient injection into the main model. A lightweight orthogonal regularization loss is introduced to preserve parameter parsimony. The approach maintains compatibility with standard fine-tuning pipelines while achieving performance on par with more complex auxiliary fine-tuning methods across diverse robotic manipulation tasks, substantially reducing computational overhead and enabling near real-time inference speeds.

Technology Category

Application Category

📝 Abstract

This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary tasks. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary task training within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver this goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies. The difference between the resulting model parameters can then be interpreted as capability vectors provided by auxiliary tasks. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Experimental results demonstrate that this approach is highly effective across diverse robot tasks. Project page: https://chris1220313648.github.io/Fast-dVLA/

Problem

Research questions and friction points this paper is trying to address.

VLA

supervised finetuning

auxiliary training

computational overhead

adaptation cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

parameter decoupling

capability vector

orthogonal regularization

efficient finetuning

vision-language-action

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

AI Research Scientist, Robotics