Flow-Based Policy for Online Reinforcement Learning

📅 2025-06-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In online reinforcement learning (RL), policy representations based on normalizing flows suffer from a fundamental objective mismatch with value-driven optimization. This paper introduces FlowRL, the first framework to incorporate differentiable flow models into online RL: it defines an ordinary differential equation (ODE)-based policy via a state-dependent velocity field and jointly optimizes the Q-function maximization objective with Wasserstein-2 distance regularization, enabling value-aware dynamic action generation. FlowRL bridges the intrinsic gap between the density estimation objective of flow modeling and the policy optimization objective in RL. Evaluated on DMControl and HumanoidBench benchmarks, FlowRL achieves state-of-the-art performance among online RL algorithms, demonstrating significantly improved modeling capacity for multimodal and non-Gaussian action distributions, as well as enhanced policy stability.

Technology Category

Application Category

📝 Abstract
We present extbf{FlowRL}, a novel framework for online reinforcement learning that integrates flow-based policy representation with Wasserstein-2-regularized optimization. We argue that in addition to training signals, enhancing the expressiveness of the policy class is crucial for the performance gains in RL. Flow-based generative models offer such potential, excelling at capturing complex, multimodal action distributions. However, their direct application in online RL is challenging due to a fundamental objective mismatch: standard flow training optimizes for static data imitation, while RL requires value-based policy optimization through a dynamic buffer, leading to difficult optimization landscapes. FlowRL first models policies via a state-dependent velocity field, generating actions through deterministic ODE integration from noise. We derive a constrained policy search objective that jointly maximizes Q through the flow policy while bounding the Wasserstein-2 distance to a behavior-optimal policy implicitly derived from the replay buffer. This formulation effectively aligns the flow optimization with the RL objective, enabling efficient and value-aware policy learning despite the complexity of the policy class. Empirical evaluations on DMControl and Humanoidbench demonstrate that FlowRL achieves competitive performance in online reinforcement learning benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Enhancing policy expressiveness in online reinforcement learning
Aligning flow-based optimization with dynamic RL objectives
Improving performance in complex, multimodal action distributions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow-based policy representation for RL
Wasserstein-2-regularized optimization approach
State-dependent velocity field action generation
🔎 Similar Papers
No similar papers found.
L
Lei Lv
ByteDance Seed, Shanghai Research Institute for Intelligent Autonomous Systems,Tongji University
Yunfei Li
Yunfei Li
ByteDance Seed
Reinforcement LearningRobotics
Y
Yu Luo
Tsinghua University
F
Fuchun Sun
Tsinghua University
Tao Kong
Tao Kong
ByteDance Research
Robot Foundation ModelRobot LearningComputer Vision
J
Jiafeng Xu
ByteDance Seed
X
Xiao Ma
ByteDance Seed