SOP: A Scalable Online Post-Training System for Vision-Language-Action Models

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes SOP, the first online, multi-robot collaborative, and multi-task post-training framework for general-purpose vision-language-action (VLA) models. Existing VLA post-training methods are typically offline, single-machine, or task-specific, limiting their capacity for efficient online adaptation and large-scale real-world learning. SOP addresses this gap through a closed-loop bitstream architecture that tightly couples a fleet of robots with a cloud-based learner. The system integrates interactive imitation learning (HG-DAgger) and reinforcement learning (RECAP), enabling asynchronous policy updates and human-in-the-loop interventions. Evaluated on real-world tasks such as cloth folding and box assembly, SOP significantly improves pretrained model performance within hours, with gains scaling nearly linearly with the number of robots while preserving the generality of a single shared policy.

Technology Category

Application Category

📝 Abstract
Vision-language-action (VLA) models achieve strong generalization through large-scale pre-training, but real-world deployment requires expert-level task proficiency in addition to broad generality. Existing post-training approaches for VLA models are typically offline, single-robot, or task-specific, limiting effective on-policy adaptation and scalable learning from real-world interaction. We introduce a Scalable Online Post-training (SOP) system that enables online, distributed, multi-task post-training of generalist VLA models directly in the physical world. SOP tightly couples execution and learning through a closed-loop architecture in which a fleet of robots continuously streams on-policy experience and human intervention signals to a centralized cloud learner, and asynchronously receives updated policies. This design supports prompt on-policy correction, scales experience collection through parallel deployment, and preserves generality during adaptation. SOP is agnostic to the choice of post-training algorithm; we instantiate it with both interactive imitation learning (HG-DAgger) and reinforcement learning (RECAP). Across a range of real-world manipulation tasks including cloth folding, box assembly, and grocery restocking, we show that SOP substantially improves the performance of large pretrained VLA models while maintaining a single shared policy across tasks. Effective post-training can be achieved within hours of real-world interaction, and performance scales near-linearly with the number of robots in the fleet. These results suggest that tightly coupling online learning with fleet-scale deployment is instrumental to enabling efficient, reliable, and scalable post-training of generalist robot policies in the physical world.
Problem

Research questions and friction points this paper is trying to address.

vision-language-action models
online post-training
real-world robot learning
scalable adaptation
on-policy learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

online post-training
vision-language-action models
fleet-scale robot learning
on-policy adaptation
closed-loop learning system
🔎 Similar Papers
Mingjie Pan
Mingjie Pan
Peking University
S
Siyuan Feng
Agibot Research
Q
Qinglin Zhang
Agibot Research
X
Xinchen Li
Agibot Research
J
Jianheng Song
Agibot Research
Chendi Qu
Chendi Qu
Shanghai Jiao Tong University
optimal controlrobotics
Yi Wang
Yi Wang
Shanghai AI Laboratory
Computer VisionPattern Recognition
C
Chuankang Li
Agibot Research
Z
Ziyu Xiong
Agibot Research
Z
Zhi Chen
Agibot Research
Y
Yi Liu
Agibot Research
Jianlan Luo
Jianlan Luo
UC Berkeley, Google X
RoboticsMachine LearningArtificial Intelligence