Adaptive Capacity Allocation for Vision Language Action Fine-tuning

📅 2026-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing Vision-Language-Action (VLA) models in cross-environment and cross-task fine-tuning, where fixed low-rank adaptation methods like LoRA fail to accommodate the high and dynamically varying intrinsic rank requirements. To overcome this, we propose LoRA-SP, the first energy-driven dynamic rank selection framework for VLA fine-tuning. LoRA-SP employs an input- and layer-adaptive rank allocation mechanism that automatically identifies critical adaptation directions based on a spectral energy threshold. By integrating SVD-style parameterization, non-negative routing scores, a shared vector bank, and energy-based pruning, our method achieves performance on par with or superior to full fine-tuning in real-world robotic multi-task manipulation—using fewer trainable parameters—and improves task success rates by up to 31.6% over standard LoRA, while significantly enhancing generalization and robustness to perturbations.

Technology Category

Application Category

📝 Abstract
Vision language action models (VLAs) are increasingly used for Physical AI, but deploying a pre-trained VLA model to unseen environments, embodiments, or tasks still requires adaptation. Parameter-efficient fine-tuning (PEFT), especially LoRA, is common for VLA policies, yet the exposed capacity knob, the rank, does not transfer uniformly: robotics transfer exhibits a higher and task-varying intrinsic rank than language fine-tuning. Small ranks suffice for LLMs (e.g., $r \in \{4, 8\}$), while spectral analyses indicate VLAs may require much larger ranks (e.g., $r \approx 128$) or near-full rank, a mismatch that worsens in multi-task settings. We present LoRA-SP (Select-Prune), a rank-adaptive fine-tuning method that replaces fixed-rank updates with input- and layer-wise capacity. LoRA-SP uses an SVD-style parameterization with a small router whose nonnegative scores act as singular values over a shared vector bank. The active set is chosen by an energy target on the cumulative squared scores $E(k) \ge \eta$, providing a direct link to approximation error via our spectral analysis. During training, $\eta$ concentrates energy on a few directions and teaches the router to rely on fewer vectors while preserving accuracy. This yields compact adapters that reduce cross-task interference and improve generalization. On four real-robot manipulation tasks collected on an unseen AgileX PiPER arm, across two VLA backbones ($\pi_0$ and SmolVLA), LoRA-SP matches or exceeds full fine-tuning with far fewer trainable parameters, and improves multi-task success by up to 31.6% over standard LoRA while remaining robust to rank choice.
Problem

Research questions and friction points this paper is trying to address.

Vision Language Action Models
Parameter-Efficient Fine-Tuning
Rank Adaptation
Multi-task Learning
Robotic Transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA-SP
rank-adaptive fine-tuning
vision-language-action models
parameter-efficient adaptation
spectral energy pruning
🔎 Similar Papers
No similar papers found.
D
Donghoon Kim
Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, Republic of Korea
M
Minji Bae
Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, Republic of Korea
U
Unghui Nam
Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, Republic of Korea
G
Gyeonghun Kim
Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, Republic of Korea
S
Suyun Lee
Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, Republic of Korea
Kyuhong Shim
Kyuhong Shim
Sungkyunkwan University
Deep LearningSpeech ProcessingLanguage Processing
Byonghyo Shim
Byonghyo Shim
Professor, Department of Electrical and Computer Engineering, Seoul National University
Wireless CommunicationsDeep LearningInformation TheoryStatistical Signal Processing