🤖 AI Summary
This work addresses the limitations of existing Vision-Language-Action (VLA) models in cross-environment and cross-task fine-tuning, where fixed low-rank adaptation methods like LoRA fail to accommodate the high and dynamically varying intrinsic rank requirements. To overcome this, we propose LoRA-SP, the first energy-driven dynamic rank selection framework for VLA fine-tuning. LoRA-SP employs an input- and layer-adaptive rank allocation mechanism that automatically identifies critical adaptation directions based on a spectral energy threshold. By integrating SVD-style parameterization, non-negative routing scores, a shared vector bank, and energy-based pruning, our method achieves performance on par with or superior to full fine-tuning in real-world robotic multi-task manipulation—using fewer trainable parameters—and improves task success rates by up to 31.6% over standard LoRA, while significantly enhancing generalization and robustness to perturbations.
📝 Abstract
Vision language action models (VLAs) are increasingly used for Physical AI, but deploying a pre-trained VLA model to unseen environments, embodiments, or tasks still requires adaptation. Parameter-efficient fine-tuning (PEFT), especially LoRA, is common for VLA policies, yet the exposed capacity knob, the rank, does not transfer uniformly: robotics transfer exhibits a higher and task-varying intrinsic rank than language fine-tuning. Small ranks suffice for LLMs (e.g., $r \in \{4, 8\}$), while spectral analyses indicate VLAs may require much larger ranks (e.g., $r \approx 128$) or near-full rank, a mismatch that worsens in multi-task settings. We present LoRA-SP (Select-Prune), a rank-adaptive fine-tuning method that replaces fixed-rank updates with input- and layer-wise capacity. LoRA-SP uses an SVD-style parameterization with a small router whose nonnegative scores act as singular values over a shared vector bank. The active set is chosen by an energy target on the cumulative squared scores $E(k) \ge \eta$, providing a direct link to approximation error via our spectral analysis. During training, $\eta$ concentrates energy on a few directions and teaches the router to rely on fewer vectors while preserving accuracy. This yields compact adapters that reduce cross-task interference and improve generalization. On four real-robot manipulation tasks collected on an unseen AgileX PiPER arm, across two VLA backbones ($\pi_0$ and SmolVLA), LoRA-SP matches or exceeds full fine-tuning with far fewer trainable parameters, and improves multi-task success by up to 31.6% over standard LoRA while remaining robust to rank choice.