VLA Model Post-Training via Action-Chunked PPO and Self Behavior Cloning

๐Ÿ“… 2025-09-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address reward sparsity, temporal policy inconsistency, and training instability in reinforcement learning (RL) post-training of vision-language-action (VLA) models, this paper proposes a PPO-based framework with action chunking, integrated self-behavioral cloning (Self-BC) loss, and a dynamically updated demonstration buffer, alongside an online dual-objective loss weighting strategy. Key contributions are: (1) segmenting continuous action sequences into semantically coherent action chunks to improve temporal policy modeling; (2) constructing a dynamic demonstration buffer from self-collected high-quality trajectories to increase feedback density; and (3) jointly optimizing RL and BC objectives for stable, efficient training. Evaluated on MetaWorld, our method achieves a success rate of 0.93 and reduces average task completion steps to 42.17โ€”substantially outperforming supervised fine-tuning baselines. Results demonstrate the effectiveness and robustness of the proposed approach for RL post-training of VLA models.

Technology Category

Application Category

๐Ÿ“ Abstract
Reinforcement learning (RL) is a promising avenue for post-training vision-language-action (VLA) models, but practical deployment is hindered by sparse rewards and unstable training. This work mitigates these challenges by introducing an action chunk based on proximal policy optimization (PPO) with behavior cloning using self-collected demonstrations. Aggregating consecutive actions into chunks improves the temporal consistency of the policy and the density of informative feedback. In addition, an auxiliary behavior cloning loss is applied with a dynamically updated demonstration buffer that continually collects high-quality task trials during training. The relative weight between the action-chunked PPO objective and the self behavior clone auxiliary loss is adapted online to stabilize the post-training process. Experiments on the MetaWorld benchmark indicate improved performance over supervised fine-tuning, achieving a high success rate (0.93) and few steps to success (42.17). These results demonstrate the viability of RL for VLA post-training and help lay the groundwork for downstream VLA applications.
Problem

Research questions and friction points this paper is trying to address.

Improving reinforcement learning for vision-language-action model post-training
Addressing sparse rewards and unstable training in VLA deployment
Enhancing temporal policy consistency via action chunking techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Action-chunked PPO for temporal consistency
Self behavior cloning with dynamic buffer
Online adaptation of loss weights
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Si-Cheng Wang
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, and School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
T
Tian-Yu Xiang
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, and School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
Xiao-Hu Zhou
Xiao-Hu Zhou
Institute of Automation, Chinese Academy of Sciences
Medical roboticsImage analysisDeep learning
Mei-Jiang Gui
Mei-Jiang Gui
Institute of Automation, Chinese Academy of Sciences
Surgical RobotTactile Perception
Xiao-Liang Xie
Xiao-Liang Xie
Chinese Academy of Sciences
Robotic surgery
S
Shi-Qi Liu
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, and School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
S
Shuang-Yi Wang
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, and School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
A
Ao-Qun Jin
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, and School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
Zeng-Guang Hou
Zeng-Guang Hou
Professor and Deputy Director, SKLMCCS, Institute of Automation, Chinese Academy of Sciences
Computational IntelligenceRoboticsMedical RobotsIntelligent Systems