🤖 AI Summary
Gaussian policies in continuous control struggle to model multimodal action distributions, limiting performance in partially observable or reward-sparse environments. Method: We propose Flow Policy Optimization (FPO), the first on-policy policy gradient framework integrating conditional flow matching. FPO introduces an advantage-weighted flow matching loss compatible with PPO-clip, eliminating the need for explicit likelihood computation and decoupling training objectives from sampling. This enables end-to-end training of diffusion-based policies from scratch. Contribution/Results: Evaluated on standard continuous control benchmarks, FPO consistently outperforms Gaussian policies—particularly under challenging conditions such as partial observability and sparse rewards. It demonstrates superior multimodal action modeling capability and enhanced policy robustness, establishing a new paradigm for expressive, tractable policy representation in on-policy reinforcement learning.
📝 Abstract
Flow-based generative models, including diffusion models, excel at modeling continuous distributions in high-dimensional spaces. In this work, we introduce Flow Policy Optimization (FPO), a simple on-policy reinforcement learning algorithm that brings flow matching into the policy gradient framework. FPO casts policy optimization as maximizing an advantage-weighted ratio computed from the conditional flow matching loss, in a manner compatible with the popular PPO-clip framework. It sidesteps the need for exact likelihood computation while preserving the generative capabilities of flow-based models. Unlike prior approaches for diffusion-based reinforcement learning that bind training to a specific sampling method, FPO is agnostic to the choice of diffusion or flow integration at both training and inference time. We show that FPO can train diffusion-style policies from scratch in a variety of continuous control tasks. We find that flow-based models can capture multimodal action distributions and achieve higher performance than Gaussian policies, particularly in under-conditioned settings.