CO-RFT: Efficient Fine-Tuning of Vision-Language-Action Models through Chunked Offline Reinforcement Learning

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

252K/year

🤖 AI Summary

To address low sample efficiency, poor compatibility with action chunking, and training instability of Vision-Language-Action (VLA) models in real-world robot control, this paper proposes Chunked RL—a novel offline reinforcement learning framework that integrates temporal-difference learning with an explicit action chunking mechanism for efficient fine-tuning from only 30–60 demonstration trajectories. The method initializes the policy via full-parameter imitation learning, enabling end-to-end joint optimization across vision, language, and action modalities. Experiments on physical robots demonstrate a 57% improvement in task success rate over supervised learning baselines and a 22.3% reduction in cycle time. Moreover, the learned policy achieves 44.3% success on unseen object placements, significantly enhancing deployment efficiency and cross-scenario generalization capability.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models demonstrate significant potential for developing generalized policies in real-world robotic control. This progress inspires researchers to explore fine-tuning these models with Reinforcement Learning (RL). However, fine-tuning VLA models with RL still faces challenges related to sample efficiency, compatibility with action chunking, and training stability. To address these challenges, we explore the fine-tuning of VLA models through offline reinforcement learning incorporating action chunking. In this work, we propose Chunked RL, a novel reinforcement learning framework specifically designed for VLA models. Within this framework, we extend temporal difference (TD) learning to incorporate action chunking, a prominent characteristic of VLA models. Building upon this framework, we propose CO-RFT, an algorithm aimed at fine-tuning VLA models using a limited set of demonstrations (30 to 60 samples). Specifically, we first conduct imitation learning (IL) with full parameter fine-tuning to initialize both the backbone and the policy. Subsequently, we implement offline RL with action chunking to optimize the pretrained policy. Our empirical results in real-world environments demonstrate that CO-RFT outperforms previous supervised methods, achieving a 57% improvement in success rate and a 22.3% reduction in cycle time. Moreover, our method exhibits robust positional generalization capabilities, attaining a success rate of 44.3% in previously unseen positions.

Problem

Research questions and friction points this paper is trying to address.

Improve sample efficiency in VLA model fine-tuning

Enhance compatibility with action chunking in RL

Ensure training stability for VLA model optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chunked RL framework for VLA models

Offline RL with action chunking

Combines imitation and reinforcement learning

🔎 Similar Papers

FGAIF: Aligning Large Vision-Language Models with Fine-grained AI Feedback