OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the low sample efficiency, reliance on expert demonstrations, and difficulty in learning from poor initial policies that plague generative control strategies during fine-tuning. To overcome these limitations, the authors propose OGPO, an algorithm that integrates off-policy reinforcement learning with a modified PPO objective. OGPO maximizes data reuse through a critic network and propagates policy gradients throughout the entire generation process, enabling efficient full-parameter fine-tuning. Notably, it achieves near-perfect task success without any expert data, even when initialized from a poorly performing behavior cloning policy. The method incorporates several stabilizing mechanisms—including success-buffer regularization, conservative advantage estimation, χ² regularization, and Q-variance reduction—yielding significant performance gains over existing approaches on multitask manipulation, high-precision insertion, and dexterous control benchmarks, particularly excelling in policy guidance and residual correction.

📝 Abstract

Generative control policies (GCPs), such as diffusion- and flow-based control policies, have emerged as effective parameterizations for robot learning. This work introduces Off-policy Generative Policy Optimization (OGPO), a sample-efficient algorithm for finetuning GCPs that maintains off-policy critic networks to maximize data reuse and propagate policy gradients through the full generative process of the policy via a modified PPO objective, using critics as the terminal reward. OGPO achieves state-of-the-art performance on manipulation tasks spanning multi-task settings, high-precision insertion, and dexterous control. To our knowledge, it is also the only method that can fine-tune poorly-initialized behavior cloning policies to near full task-success with no expert data in the online replay buffer, and does so with few task-specific hyperparameter tuning. Through extensive empirical investigations, we demonstrate the OGPO drastically outperforms methods alternatives on policy steering and learning residual corrections, and identify the key mechanisms behind its performance. We further introduce practical stabilizers, including success-buffer regularization, conservative advantages, $χ^2$ regularization, and Q-variance reduction, to mitigate critic over-exploitation across state- and pixel-based settings. Beyond proposing OGPO, we conduct a systematic empirical study of GCP finetuning, identifying the stabilizing mechanisms and failure modes that govern successful off-policy full-policy improvement.

Problem

Research questions and friction points this paper is trying to address.

generative control policies

sample-efficient finetuning

off-policy optimization

policy improvement

robot learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Off-policy Learning

Generative Control Policies

Policy Finetuning