AtomVLA: Scalable Post-Training for Robotic Manipulation via Predictive Latent World Models

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of error accumulation in long-horizon tasks faced by existing vision-language-action (VLA) models due to the absence of intermediate guidance. The authors propose the first subtask-aware VLA framework, which leverages a large language model to decompose high-level instructions into atomic subtasks and employs a pretrained predictive latent world model to evaluate the alignment between action segments and subtask goals in latent space. By integrating offline post-training with Group Relative Policy Optimization, the method effectively suppresses error propagation without requiring costly online interaction. The approach achieves state-of-the-art average success rates of 97.0% on LIBERO and 48.0% on the more challenging LIBERO-PRO benchmark, and demonstrates strong generalization to complex long-horizon tasks on the Galaxea R1 Lite real-world robot platform.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation. The execution of complex multi-step behaviors in VLA models can be improved by robust instruction grounding, a critical component for effective control. However, current paradigms predominantly rely on coarse, high-level task instructions during supervised fine-tuning. This instruction grounding gap leaves models without explicit intermediate guidance, leading to severe compounding errors in long-horizon tasks. Therefore, bridging this instruction gap and providing scalable post-training for VLA models is urgent. To tackle this problem, we propose \method, the first subtask-aware VLA framework integrated with a scalable offline post-training pipeline. Our framework leverages a large language model to decompose high-level demonstrations into fine-grained atomic subtasks. This approach utilizes a pretrained predictive world model to score candidate action chunks against subtask goals in the latent space, mitigating error accumulation while significantly improving long-horizon robustness. Furthermore, this approach enables highly efficient Group Relative Policy Optimization without the prohibitive expenses associated with online rollouts on physical robots. Extensive simulations validate that our AtomVLA maintains strong robustness under perturbations. When evaluated against fundamental baseline models, it achieves an average success rate of 97.0\% on the LIBERO benchmark and 48.0\% on the LIBERO-PRO benchmark. Finally, experiments conducted in the real world using the Galaxea R1 Lite platform confirm its broad applicability across diverse tasks, especially long-horizon tasks. All datasets, checkpoints, and code will be released to the public domain following the acceptance of this work for future research.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
instruction grounding
long-horizon tasks
error compounding
robotic manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action (VLA)
predictive latent world model
subtask decomposition
offline post-training
long-horizon robotic manipulation
🔎 Similar Papers
No similar papers found.
X
Xiaoquan Sun
INFIFORCE Intelligent Technology Co., Ltd. Hangzhou, China; Huazhong University of Science and Technology, Wuhan, China
Z
Zetian Xu
INFIFORCE Intelligent Technology Co., Ltd. Hangzhou, China; The University of Hong Kong, Hong Kong SAR, China
C
Chen Cao
INFIFORCE Intelligent Technology Co., Ltd. Hangzhou, China; The University of Hong Kong, Hong Kong SAR, China
Z
Zonghe Liu
INFIFORCE Intelligent Technology Co., Ltd. Hangzhou, China; The University of Hong Kong, Hong Kong SAR, China
Yihan Sun
Yihan Sun
Assistant Professor, University of California, Riverside
Parallel Algorithms
J
Jingrui Pang
Tsinghua University, Beijing, China
R
Ruijian Zhang
Huazhong University of Science and Technology, Wuhan, China
Zhen Yang
Zhen Yang
Hong Kong University of Science and Technology (Guangzhou)
Artificial Intelligence
K
Kang Pang
Huazhong University of Science and Technology, Wuhan, China
D
Dingxin He
Huazhong University of Science and Technology, Wuhan, China
Mingqi Yuan
Mingqi Yuan
PhD candidate at HKPU
Machine Learning
J
Jiayu Chen
INFIFORCE Intelligent Technology Co., Ltd. Hangzhou, China; The University of Hong Kong, Hong Kong SAR, China