Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

To address poor generalization, low inference efficiency, and inflexible interfaces when adapting vision-language-action (VLA) models to novel robotic platforms, this paper proposes Optimized Fine-Tuning (OFT), a novel fine-tuning paradigm. OFT is the first to jointly integrate parallel decoding, action chunking, continuous action representation, and an L1 regression objective. Leveraging the OpenVLA architecture, OFT is jointly fine-tuned across both simulated (LIBERO) and real-world (ALOHA dual-arm robot) environments. Experiments demonstrate that OFT boosts average task success rate on LIBERO from 76.5% to 97.1% and increases action throughput by 26×. On the ALOHA platform, OFT achieves a 15-percentage-point improvement in task success rate over state-of-the-art VLAs and end-to-end imitation learning methods. Overall, OFT significantly enhances semantic generalization, language grounding fidelity, and deployment flexibility.

Technology Category

Application Category

📝 Abstract

Recent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine-tuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies. In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine-tuning, using OpenVLA as our representative base model. Our empirical analysis informs an Optimized Fine-Tuning (OFT) recipe that integrates parallel decoding, action chunking, a continuous action representation, and a simple L1 regression-based learning objective to altogether improve inference efficiency, policy performance, and flexibility in the model's input-output specifications. We propose OpenVLA-OFT, an instantiation of this recipe, which sets a new state of the art on the LIBERO simulation benchmark, significantly boosting OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$ imes$. In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot and outperform other VLAs ($pi_0$ and RDT-1B) fine-tuned using their default recipes, as well as strong imitation learning policies trained from scratch (Diffusion Policy and ACT) by up to 15% (absolute) in average success rate. We release code for OFT and pretrained model checkpoints at https://openvla-oft.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Optimizing fine-tuning strategies for Vision-Language-Action models

Enhancing model performance on novel robot setups

Improving inference efficiency and task success rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimized Fine-Tuning with parallel decoding

Action chunking for efficient policy execution

Continuous action representation enhances flexibility

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling