🤖 AI Summary
This work addresses the limitations of existing vision-language-action (VLA) models in robotic control tasks, where full fine-tuning often leads to overfitting and catastrophic forgetting of pretrained knowledge, while current parameter-efficient fine-tuning methods exhibit limited adaptability. To overcome these challenges, the authors propose VLA-GSE, a novel framework that introduces spectral decomposition into VLA fine-tuning for the first time. This approach decouples the frozen backbone into shared generalist experts and routed specialist experts, establishing a collaborative mechanism between them. Updating only 2.51% of the model parameters, VLA-GSE achieves a zero-shot success rate of 81.2% on LIBERO-Plus, matches LoRA in multimodal understanding capability, and significantly enhances real-world manipulation performance under various distribution shifts.
📝 Abstract
Vision-language-action (VLA) models inherit rich visual-semantic priors from pre-trained vision-language backbones, but adapting them to robotic control remains challenging. Full fine-tuning (FFT) is prone to overfitting on downstream robotic data and catastrophic forgetting of pretrained vision-language capabilities. Parameter-efficient fine-tuning (PEFT) better preserves pre-trained knowledge, yet existing PEFT methods still struggle to adapt effectively to robot control tasks. To address this gap, we propose VLA-GSE, a parameter-efficient VLA fine-tuning framework that improves control adaptation while retaining PEFT's knowledge preservation advantage. Specifically, VLA-GSE (Generalized and Specialized Experts) is initialized by spectrally decomposing the frozen backbone, assigning leading singular components to generalized experts (shared experts) and disjoint residual components to specialized experts (routed experts). This decomposition improves adaptation capacity under a fixed trainable-parameter budget. Under a comparable parameter budget, VLA-GSE updates only 2.51% of the full model parameters and consistently outperforms strong FFT and PEFT baselines. It achieves 81.2% average zero-shot success on LIBERO-Plus, preserves pre-trained VLM capability comparably to LoRA on multimodal understanding benchmarks, and improves real-world manipulation success under multiple distribution shifts. Code is available at: https://github.com/YuhuaJiang2002/VLA-GSE