Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

📅 2026-05-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

209K/year
🤖 AI Summary
Offline pretraining struggles to handle distribution shifts, task variations, and human interventions encountered during real-world deployment, limiting the robustness of general-purpose robotic policies. This work proposes Learning While Deploying (LWD), a framework that enables continuous online learning of vision-language-action (VLA) generalist policies across large-scale heterogeneous robot fleets for the first time. The approach integrates Distributional Implicit Value Learning (DIVL) for robust value estimation and leverages Q-learning via Adjoint Matching (QAM) to efficiently distill policies from streaming VLA models. Experiments on a fleet of 16 dual-arm robots demonstrate that a single generalist policy continuously improves as collective experience accumulates, achieving an average task success rate of 95%, with particularly pronounced gains in long-horizon and long-tail scenarios.
📝 Abstract
Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies. Starting from a pretrained VLA policy, LWD closes the loop between deployment, shared physical experience, policy improvement, and redeployment by using autonomous rollouts and human interventions collected across a robot fleet. To stabilize learning from heterogeneous, sparse-reward fleet data, LWD combines Distributional Implicit Value Learning (DIVL) for robust value estimation with Q-learning via Adjoint Matching (QAM) for policy extraction in flow-based VLA action generators. We validate LWD on a fleet of 16 dual-arm robots across eight real-world manipulation tasks, including semantic grocery restocking and 3--5 minute long-horizon tasks. A single generalist policy improves as fleet experience accumulates, reaching an average success rate of 95%, with the largest gains on long-horizon tasks.
Problem

Research questions and friction points this paper is trying to address.

distribution shift
long-tail failures
task variations
human corrections
real-world deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

fleet-scale reinforcement learning
Vision-Language-Action (VLA)
Distributional Implicit Value Learning (DIVL)
Q-learning via Adjoint Matching (QAM)
continual post-training
🔎 Similar Papers
No similar papers found.
Yi Wang
Yi Wang
Shanghai AI Laboratory
Computer VisionPattern Recognition
X
Xinchen Li
AGIBOT Finch
P
Pengwei Xie
AGIBOT Finch
P
Pu Yang
AGIBOT Finch
Buqing Nie
Buqing Nie
Shanghai Jiao Tong University
Reinforcement LearningRobot Learning
Y
Yunuo Cai
Shanghai Innovation Institute; AGIBOT Finch
Q
Qinglin Zhang
AGIBOT Finch
Chendi Qu
Chendi Qu
Shanghai Jiao Tong University
optimal controlrobotics
Jeffrey Wu
Jeffrey Wu
Anthropic AI, OpenAI
J
Jianheng Song
AGIBOT Finch
X
Xinlin Ren
AGIBOT Finch
J
Jingshun Huang
Shanghai Innovation Institute; AGIBOT Finch
Mingjie Pan
Mingjie Pan
Peking University
S
Siyuan Feng
AGIBOT Finch
Z
Zhi Chen
AGIBOT Finch
Jianlan Luo
Jianlan Luo
UC Berkeley, Google X
RoboticsMachine LearningArtificial Intelligence