RL Token: Bootstrapping Online RL with Vision-Language-Action Models

📅 2026-04-24

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing vision-language-action (VLA) models struggle to balance accuracy and execution efficiency in real-world robotic tasks, hindering their direct deployment. This work proposes a lightweight online reinforcement learning fine-tuning approach that introduces an “RL Token” as a compact readout representation, establishing an efficient interface between VLA models and reinforcement learning while preserving pre-trained knowledge. By integrating a small Actor-Critic head with a policy anchoring mechanism, the method enables rapid policy optimization without degrading the model’s original capabilities. Experiments on four real-world robotic tasks demonstrate that, with only minutes to a few hours of real interaction data, the approach substantially improves task success rates—accelerating execution by up to 3× on the most challenging segments and surpassing human teleoperation performance in certain tasks.

Technology Category

Application Category

📝 Abstract

Vision-language-action (VLA) models can learn to perform diverse manipulation skills "out of the box," but achieving the precision and speed that real-world tasks demand requires further fine-tuning -- for example, via reinforcement learning (RL). We introduce a lightweight method that enables sample-efficient online RL fine-tuning of pretrained VLAs using just a few hours of real-world practice. We (1) adapt the VLA to expose an "RL token," a compact readout representation that preserves task-relevant pretrained knowledge while serving as an efficient interface for online RL, and (2) train a small actor-critic head on this RL token to refine the actions, while anchoring the learned policy to the VLA. Online RL with the RL token (RLT) makes it possible to fine-tune even large VLAs with RL quickly and efficiently. Across four real-robot tasks (screw installation, zip tie fastening, charger insertion, and Ethernet insertion), RLT improves the speed on the hardest part of the task by up to 3x and raises success rates significantly within minutes to a few hours of practice. It can even surpass the speed of human teleoperation on some of the tasks.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

vision-language-action models

online fine-tuning

sample efficiency

real-world robotics

Innovation

Methods, ideas, or system contributions that make the work stand out.

RL token

vision-language-action models

online reinforcement learning