OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

📅 2026-05-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

209K/year
🤖 AI Summary
This work addresses the low sample efficiency, reliance on expert demonstrations, and difficulty in learning from poor initial policies that plague generative control strategies during fine-tuning. To overcome these limitations, the authors propose OGPO, an algorithm that integrates off-policy reinforcement learning with a modified PPO objective. OGPO maximizes data reuse through a critic network and propagates policy gradients throughout the entire generation process, enabling efficient full-parameter fine-tuning. Notably, it achieves near-perfect task success without any expert data, even when initialized from a poorly performing behavior cloning policy. The method incorporates several stabilizing mechanisms—including success-buffer regularization, conservative advantage estimation, χ² regularization, and Q-variance reduction—yielding significant performance gains over existing approaches on multitask manipulation, high-precision insertion, and dexterous control benchmarks, particularly excelling in policy guidance and residual correction.
📝 Abstract
Generative control policies (GCPs), such as diffusion- and flow-based control policies, have emerged as effective parameterizations for robot learning. This work introduces Off-policy Generative Policy Optimization (OGPO), a sample-efficient algorithm for finetuning GCPs that maintains off-policy critic networks to maximize data reuse and propagate policy gradients through the full generative process of the policy via a modified PPO objective, using critics as the terminal reward. OGPO achieves state-of-the-art performance on manipulation tasks spanning multi-task settings, high-precision insertion, and dexterous control. To our knowledge, it is also the only method that can fine-tune poorly-initialized behavior cloning policies to near full task-success with no expert data in the online replay buffer, and does so with few task-specific hyperparameter tuning. Through extensive empirical investigations, we demonstrate the OGPO drastically outperforms methods alternatives on policy steering and learning residual corrections, and identify the key mechanisms behind its performance. We further introduce practical stabilizers, including success-buffer regularization, conservative advantages, $χ^2$ regularization, and Q-variance reduction, to mitigate critic over-exploitation across state- and pixel-based settings. Beyond proposing OGPO, we conduct a systematic empirical study of GCP finetuning, identifying the stabilizing mechanisms and failure modes that govern successful off-policy full-policy improvement.
Problem

Research questions and friction points this paper is trying to address.

generative control policies
sample-efficient finetuning
off-policy optimization
policy improvement
robot learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Off-policy Learning
Generative Control Policies
Policy Finetuning
Sample Efficiency
Critic Regularization
🔎 Similar Papers
No similar papers found.