🤖 AI Summary
This work proposes an online reinforcement learning framework based on ordinary differential equations (ODEs) to address the challenges of intractable entropy control and high computational cost in policy gradient estimation within diffusion-based approaches. The method parameterizes policies via flow matching, samples actions along straight-line probability paths inspired by optimal transport, and guides policy updates using an advantage-weighted velocity field. Its key innovation lies in the first principled integration of an analytically tractable entropy regularization term directly into the flow-matching policy, enabling principled maximum-entropy optimization while substantially reducing training overhead. Experiments demonstrate that the approach outperforms state-of-the-art methods on the sparse-reward, multi-goal FrankaKitchen tasks, matches competitive performance on standard MuJoCo benchmarks, and achieves a 7× faster training speed than QVPO and 10–15% acceleration over efficient diffusion variants.
📝 Abstract
Diffusion-based policies have gained significant popularity in Reinforcement Learning (RL) due to their ability to represent complex, non-Gaussian distributions. Stochastic Differential Equation (SDE)-based diffusion policies often rely on indirect entropy control due to the intractability of the exact entropy, while also suffering from computationally prohibitive policy gradients through the iterative denoising chain. To overcome these issues, we propose Flow Matching Policy with Entropy Regularization (FMER), an Ordinary Differential Equation (ODE)-based online RL framework. FMER parameterizes the policy via flow matching and samples actions along a straight probability path, motivated by optimal transport. FMER leverages the model's generative nature to construct an advantage-weighted target velocity field from a candidate set, steering policy updates toward high-value regions. By deriving a tractable entropy objective, FMER enables principled maximum-entropy optimization for enhanced exploration. Experiments on sparse multi-goal FrankaKitchen benchmarks demonstrate that FMER outperforms state-of-the-art methods, while remaining competitive on standard MuJoco benchmarks. Moreover, FMER reduces training time by 7x compared to heavy diffusion baselines (QVPO) and 10-15% relative to efficient variants.