🤖 AI Summary
This work addresses the inefficiency of existing ODE-based policy generation methods in robotic manipulation, which typically require multiple inference steps. The authors propose a novel single-step, non-ODE policy generation framework that, for the first time, incorporates Wasserstein-2 gradient flows into policy learning. By modeling policy updates as a reverse KL gradient flow toward a soft target policy, the approach enables one-step optimization directly in probability space. The method integrates value improvement, anchor-policy score matching, and critic-guided action selection, and introduces a computationally tractable surrogate loss. Evaluated on multiple tasks from Robomimic and OGBench, the proposed approach achieves state-of-the-art performance with only a single inference step, significantly outperforming existing ODE-based strategies.
📝 Abstract
We propose Drifting Field Policy (DFP), a non-ODE one-step generative policy built on the drifting model paradigm. We frame the policy update as a reverse-KL Wasserstein-2 gradient flow toward a soft target policy, so that each DFP update corresponds to a gradient step in probability space. By construction, this gradient is decomposed into an ascent toward higher action-value regions and a score matching with the anchor policy as a trust region. We further derive a simple, tractable surrogate of the otherwise intractable update loss, akin to behavior cloning on top-K critic-selected actions. We find empirically that this mechanism uniquely benefits the drifting backbone owing to its non-ODE parameterization. With one-step inference, DFP achieves state-of-the-art performance on several manipulation tasks across Robomimic and OGBench, outperforming ODE-based policies.