SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Flow-based policies in off-policy reinforcement learning suffer from ill-conditioned gradients and training instability due to multi-step action sampling in rollout procedures. Method: This paper establishes, for the first time, the algebraic equivalence between flow-based rollouts and residual recurrent computation, and proposes a velocity reparameterization framework for sequence modeling. Based on this, we design two novel stable architectures—Gated Velocity Network (Flow-G) and Decoded Velocity Network (Flow-T)—enabling end-to-end differentiable training without policy distillation or auxiliary objectives. We further integrate noise-augmented rollouts into the Soft Actor-Critic (SAC) framework. Contribution/Results: Experiments demonstrate state-of-the-art performance on continuous control and robotic manipulation benchmarks, significantly improving sample efficiency. The approach supports both zero-shot learning and offline-to-online transfer learning, establishing a robust and scalable foundation for flow-based policy optimization.

Technology Category

Application Category

📝 Abstract

Training expressive flow-based policies with off-policy reinforcement learning is notoriously unstable due to gradient pathologies in the multi-step action sampling process. We trace this instability to a fundamental connection: the flow rollout is algebraically equivalent to a residual recurrent computation, making it susceptible to the same vanishing and exploding gradients as RNNs. To address this, we reparameterize the velocity network using principles from modern sequential models, introducing two stable architectures: Flow-G, which incorporates a gated velocity, and Flow-T, which utilizes a decoded velocity. We then develop a practical SAC-based algorithm, enabled by a noise-augmented rollout, that facilitates direct end-to-end training of these policies. Our approach supports both from-scratch and offline-to-online learning and achieves state-of-the-art performance on continuous control and robotic manipulation benchmarks, eliminating the need for common workarounds like policy distillation or surrogate objectives.

Problem

Research questions and friction points this paper is trying to address.

Addresses unstable training of flow-based policies in reinforcement learning

Solves gradient vanishing and exploding issues in multi-step action sampling

Enables direct end-to-end training without policy distillation workarounds

Innovation

Methods, ideas, or system contributions that make the work stand out.

Velocity-reparameterized flow architectures for stability

SAC-based algorithm with noise-augmented rollout training

Direct end-to-end training without policy distillation

🔎 Similar Papers

Bellman Diffusion Models