EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the limited generalization of pretrained vision-language-action (VLA) models in real-world robotic tasks and the poor sample efficiency of existing reinforcement learning (RL) fine-tuning approaches. The authors propose EXPO-FT, a stable and sample-efficient RL fine-tuning framework that effectively leverages the VLA policy prior and refines the policy update mechanism to enhance both training stability and data efficiency. Using only 19.1 minutes of online interaction on average, EXPO-FT achieves a perfect 30/30 success rate across multiple challenging manipulation tasks, substantially outperforming both training from scratch and current VLA fine-tuning methods, thereby overcoming critical bottlenecks in performance and efficiency.

📝 Abstract

The ability to efficiently and reliably learn new tasks has been a foundational challenge in robotics. Vision-Language-Action (VLA) models have demonstrated strong generalization across diverse manipulation tasks, yet pretrained policies consistently fall short of the reliability required for real-world deployment. Reinforcement learning (RL) fine-tuning offers a promising path to bridge this gap, but existing approaches either train from scratch without fully leveraging pretrained priors, or fine-tune VLAs without achieving the sample efficiency and success rates that practical deployment demands. We present EXPO-FT, a system for stable, sample-efficient RL finetuning of pretrained VLA policies that closes this gap. Our system solves a suite of challenging manipulation tasks, including routing string lights and inserting the plug to light it up, striking a pool ball into a pocket, and inserting a flower into a wine bottle, each requiring combinations of high precision, dynamic actions, and robustness to varied initial states. Our system achieves perfect task performance (30/30 successes) across all evaluated tasks within an average of 19.1 minutes of online robot data, outperforming both prior RL-from-scratch and VLA finetuning approaches. We release an open-source codebase with the aim of facilitating broader adoption of RL finetuning of VLA models in robotics.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models

reinforcement learning fine-tuning

sample efficiency

robotic manipulation

policy reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

sample-efficient reinforcement learning

vision-language-action models

RL fine-tuning