Drift-Based Policy Optimization: Native One-Step Policy Learning for Online Robot Control

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that multi-step generative policies, due to their requirement of multiple network calls during inference, struggle to meet the real-time demands of high-frequency closed-loop control and online reinforcement learning. To overcome this limitation, the authors propose a two-stage framework: first, they introduce Drift-Based Policy (DBP), a single-step generative policy that internalizes iterative optimization into model parameters via drift targets; second, they develop the DBPO online reinforcement learning framework, which incorporates a compatible stochastic interface to enable stable policy updates. This approach is the first to simultaneously support multimodal action modeling and native single-step inference while allowing efficient and stable online optimization. Experiments demonstrate consistent superiority over existing methods across offline imitation, online fine-tuning, and real-world robotic tasks, achieving up to 100× faster inference and enabling dual-arm robot control at 105.2 Hz.
📝 Abstract
Although multi-step generative policies achieve strong performance in robotic manipulation by modeling multimodal action distributions, they require multi-step iterative denoising at inference time. Each action therefore needs tens to hundreds of network function evaluations (NFEs), making them costly for high-frequency closed-loop control and online reinforcement learning (RL). To address this limitation, we propose a two-stage framework for native one-step generative policies that shifts refinement from inference to training. First, we introduce the Drift-Based Policy (DBP), which leverages fixed-point drifting objectives to internalize iterative refinement into the model parameters, yielding a one-step generative backbone by design while preserving multimodal action modeling capacity. Second, we develop Drift-Based Policy Optimization (DBPO), an online RL framework that equips the pretrained backbone with a compatible stochastic interface, enabling stable on-policy updates without sacrificing the one-step deployment property. Extensive experiments demonstrate the effectiveness of the proposed framework across offline imitation learning, online fine-tuning, and real-world control scenarios. DBP matches or exceeds the performance of multi-step diffusion policies while achieving up to $100\times$ faster inference. It also consistently outperforms existing one-step baselines on challenging manipulation benchmarks. Moreover, DBPO enables effective and stable policy improvement in online settings. Experiments on a real-world dual-arm robot demonstrate reliable high-frequency control at 105.2 Hz.
Problem

Research questions and friction points this paper is trying to address.

multi-step generative policies
inference cost
high-frequency control
online reinforcement learning
robotic manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Drift-Based Policy
One-Step Generative Policy
Online Reinforcement Learning
Fixed-Point Drifting
High-Frequency Robot Control
🔎 Similar Papers
No similar papers found.