Reparameterization Flow Policy Optimization

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing reparameterized policy gradient methods are constrained by Gaussian policies, struggling to balance exploration and training stability while being incompatible with advanced generative models such as normalizing flows. This work establishes, for the first time, an intrinsic connection between flow-based policies and reparameterized policy gradients, proposing a differentiable ODE-based flow policy optimization framework. By jointly backpropagating through both the flow generation process and system dynamics, the method achieves highly sample-efficient learning. Key innovations include a likelihood-free gradient estimator, an action-chunking mechanism, and a regularization term that jointly promotes stability and exploration. Experiments demonstrate significant performance gains over baseline methods across a range of rigid-body and soft-body control tasks, with nearly a two-fold improvement in reward on a soft quadruped locomotion benchmark.

Technology Category

Application Category

📝 Abstract

Reparameterization Policy Gradient (RPG) has emerged as a powerful paradigm for model-based reinforcement learning, enabling high sample efficiency by backpropagating gradients through differentiable dynamics. However, prior RPG approaches have been predominantly restricted to Gaussian policies, limiting their performance and failing to leverage recent advances in generative models. In this work, we identify that flow policies, which generate actions via differentiable ODE integration, naturally align with the RPG framework, a connection not established in prior work. However, naively exploiting this synergy proves ineffective, often suffering from training instability and a lack of exploration. We propose Reparameterization Flow Policy Optimization (RFO). RFO computes policy gradients by backpropagating jointly through the flow generation process and system dynamics, unlocking high sample efficiency without requiring intractable log-likelihood calculations. RFO includes two tailored regularization terms for stability and exploration. We also propose a variant of RFO with action chunking. Extensive experiments on diverse locomotion and manipulation tasks, involving both rigid and soft bodies with state or visual inputs, demonstrate the effectiveness of RFO. Notably, on a challenging locomotion task controlling a soft-body quadruped, RFO achieves almost $2\times$ the reward of the state-of-the-art baseline.

Problem

Research questions and friction points this paper is trying to address.

Reparameterization Policy Gradient

Flow Policies

Training Instability

Exploration

Model-based Reinforcement Learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reparameterization Flow Policy Optimization

Flow Policies

Differentiable ODE Integration