🤖 AI Summary
Diffusion models suffer from low training and inference efficiency in reinforcement learning due to their iterative generative mechanism. This work proposes the first integration of the few-step flow model MeanFlow into policy representation, combining it with a maximum-entropy reinforcement learning framework to enable soft policy iteration. By doing so, it retains strong generative capabilities while significantly improving computational efficiency. We address two key challenges in applying MeanFlow to reinforcement learning: accurate evaluation of action likelihoods and the design of an optimization objective compatible with soft policy improvement. Empirical results demonstrate that our method matches or exceeds state-of-the-art diffusion-based policy baselines on MuJoCo and DeepMind Control Suite benchmarks, while substantially reducing both training and inference time.
📝 Abstract
Diffusion models have recently emerged as expressive policy representations for online reinforcement learning (RL). However, their iterative generative processes introduce substantial training and inference overhead. To overcome this limitation, we propose to represent policies using MeanFlow models, a class of few-step flow-based generative models, to improve training and inference efficiency over diffusion-based RL approaches. To promote exploration, we optimize MeanFlow policies under the maximum entropy RL framework via soft policy iteration, and address two key challenges specific to MeanFlow policies: action likelihood evaluation and soft policy improvement. Experiments on MuJoCo and DeepMind Control Suite benchmarks demonstrate that our method, Mean Flow Policy Optimization (MFPO), achieves performance comparable to or exceeding current diffusion-based baselines while considerably reducing training and inference time. Our code is available at https://github.com/MFPolicy/MFPO.