🤖 AI Summary
This work addresses the instability and policy collapse commonly encountered when fine-tuning generative policies with reinforcement learning, particularly due to multimodal action distributions and long-horizon action sequences. The authors propose POCO, a framework that formulates policy optimization as a posterior inference problem. By employing an expectation-maximization (EM) algorithm with a clipped objective, POCO distills a reward-weighted implicit posterior into the policy, while integrating an offline-to-online learning paradigm that leverages a pretrained prior to guide exploration. Notably, the approach requires neither explicit likelihood estimation nor architectural modifications, enabling efficient and direct fine-tuning of large vision-language-action (VLA) models. Evaluated across seven simulation benchmarks and four high-contact real-world tasks, POCO significantly outperforms existing methods, achieving a remarkable 96.7% success rate in real-world tasks and effectively mitigating policy collapse.
📝 Abstract
Expressive generative models have advanced robotic manipulation by capturing complex, multi-modal action distributions over temporally extended trajectories. However, fine-tuning these policies via RL remains challenging due to instability and sample inefficiency. We introduce Posterior Optimization with Clipped Objective (POCO), a principled RL framework that formulates policy improvement as a posterior inference problem tailored for temporal action chunks. Through an Expectation-Maximization procedure, POCO distills a reward-weighted implicit posterior into the policy without likelihood estimation. Furthermore, POCO adopts an offline-to-online paradigm that anchors online exploration to pre-trained priors, and its model-agnostic design scales to fine-tune large VLA models without architectural modifications. Evaluations across 7 simulation benchmarks and 4 contact-rich real-world tasks demonstrate that POCO prevents catastrophic policy collapse, outperforms SOTA baselines, and achieves a 96.7% success rate on real-world tasks. Videos are available at our project website https://cccedric.github.io/poco/.