Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the instability in training expressive flow-based policies in reinforcement learning, which arises from backpropagating gradients through numerical solvers—a common practice that often forces a trade-off between expressiveness and stability. To overcome this limitation, the authors propose Q-Flow, a novel framework that leverages the deterministic nature of flow dynamics to explicitly propagate terminal trajectory values back to intermediate latent states. This enables stable optimization without unrolling the numerical solver, thereby preserving high representational capacity while ensuring training stability. By bridging the gap between expressiveness and stability through intermediate value gradients, Q-Flow provides a unified approach supporting both offline training and online adaptation. On the OGBench offline benchmark, Q-Flow outperforms the current state-of-the-art methods by an average of 10.6 percentage points.

📝 Abstract

There is growing interest in utilizing flow-based models as decision-making policies in reinforcement learning due to their high expressive capacity. However, effectively leveraging this expressivity for value maximization remains challenging, as naive gradient-based optimization requires backpropagating through numerical solvers and often leads to instability. Existing approaches typically address this issue by restricting the expressive capacity of flow-based policies, resulting in a trade-off between optimization stability and representational flexibility. To resolve this, we introduce Q-Flow, a framework that leverages the deterministic nature of flow dynamics to explicitly propagate terminal trajectory value to intermediate latent states along the policy-induced flow. This formulation enables stable policy optimization using intermediate value gradients without unrolling the numerical solver, effectively bridging the gap between stability and expressivity. We evaluate Q-Flow in the offline learning setting on the challenging OGBench suite, where it consistently outperforms state-of-the-art baselines by an average of 10.6 percentage points, while also enabling stable online adaptation within the same framework.

Problem

Research questions and friction points this paper is trying to address.

flow-based policy

reinforcement learning

optimization stability

expressivity

value maximization

Innovation

Methods, ideas, or system contributions that make the work stand out.

flow-based policy

reinforcement learning

value propagation