One-Step Flow Policy Mirror Descent

๐Ÿ“… 2025-07-31
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Diffusion-based policies achieve strong performance in online reinforcement learning but suffer from prohibitively slow inference due to iterative multi-step sampling. To address this, we propose Flow Policy Mirror Descent (FPMD), the first framework enabling single-step *explicit* policy inference. FPMD parameterizes both a flow policy and a MeanFlow policy grounded in flow matching, and integrates them into a mirror descent optimization framework. Crucially, we theoretically characterize the quantitative relationship between distributional variance and single-step discretization errorโ€”enabling high-fidelity one-step sampling *without* knowledge distillation or consistency training. Evaluated on the MuJoCo benchmark, FPMD matches state-of-the-art diffusion policies in control performance while reducing function evaluations by two to three orders of magnitude, thereby significantly enhancing real-time capability for online decision-making.

Technology Category

Application Category

๐Ÿ“ Abstract
Diffusion policies have achieved great success in online reinforcement learning (RL) due to their strong expressive capacity. However, the inference of diffusion policy models relies on a slow iterative sampling process, which limits their responsiveness. To overcome this limitation, we propose Flow Policy Mirror Descent (FPMD), an online RL algorithm that enables 1-step sampling during policy inference. Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight interpolation flow matching models, and requires no extra distillation or consistency training. We present two algorithm variants based on flow policy and MeanFlow policy parametrizations, respectively. Extensive empirical evaluations on MuJoCo benchmarks demonstrate that our algorithms show strong performance comparable to diffusion policy baselines while requiring hundreds of times fewer function evaluations during inference.
Problem

Research questions and friction points this paper is trying to address.

Slow iterative sampling in diffusion policies limits responsiveness
Proposing 1-step sampling for faster policy inference in RL
Achieving performance comparable to diffusion baselines with fewer evaluations
Innovation

Methods, ideas, or system contributions that make the work stand out.

One-step sampling for faster policy inference
Flow Policy Mirror Descent algorithm introduced
No extra distillation or consistency training needed
๐Ÿ”Ž Similar Papers
T
Tianyi Chen
Georgia Institute of Technology
Haitong Ma
Haitong Ma
Graduate student, Harvard University
Reinforcement LearningRoboticsControl Theory
N
Na Li
Harvard University
K
Kai Wang
Georgia Institute of Technology
B
Bo Dai
Georgia Institute of Technology