SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Flow-based policies in off-policy reinforcement learning suffer from ill-conditioned gradients and training instability due to multi-step action sampling in rollout procedures. Method: This paper establishes, for the first time, the algebraic equivalence between flow-based rollouts and residual recurrent computation, and proposes a velocity reparameterization framework for sequence modeling. Based on this, we design two novel stable architectures—Gated Velocity Network (Flow-G) and Decoded Velocity Network (Flow-T)—enabling end-to-end differentiable training without policy distillation or auxiliary objectives. We further integrate noise-augmented rollouts into the Soft Actor-Critic (SAC) framework. Contribution/Results: Experiments demonstrate state-of-the-art performance on continuous control and robotic manipulation benchmarks, significantly improving sample efficiency. The approach supports both zero-shot learning and offline-to-online transfer learning, establishing a robust and scalable foundation for flow-based policy optimization.

Technology Category

Application Category

📝 Abstract
Training expressive flow-based policies with off-policy reinforcement learning is notoriously unstable due to gradient pathologies in the multi-step action sampling process. We trace this instability to a fundamental connection: the flow rollout is algebraically equivalent to a residual recurrent computation, making it susceptible to the same vanishing and exploding gradients as RNNs. To address this, we reparameterize the velocity network using principles from modern sequential models, introducing two stable architectures: Flow-G, which incorporates a gated velocity, and Flow-T, which utilizes a decoded velocity. We then develop a practical SAC-based algorithm, enabled by a noise-augmented rollout, that facilitates direct end-to-end training of these policies. Our approach supports both from-scratch and offline-to-online learning and achieves state-of-the-art performance on continuous control and robotic manipulation benchmarks, eliminating the need for common workarounds like policy distillation or surrogate objectives.
Problem

Research questions and friction points this paper is trying to address.

Addresses unstable training of flow-based policies in reinforcement learning
Solves gradient vanishing and exploding issues in multi-step action sampling
Enables direct end-to-end training without policy distillation workarounds
Innovation

Methods, ideas, or system contributions that make the work stand out.

Velocity-reparameterized flow architectures for stability
SAC-based algorithm with noise-augmented rollout training
Direct end-to-end training without policy distillation
🔎 Similar Papers
No similar papers found.
Yixian Zhang
Yixian Zhang
Tsinghua University
Reinforcement LearningOptimization
S
Shu'ang Yu
Tsinghua University
T
Tonghe Zhang
Carnegie Mellon University
M
Mo Guang
Li Auto
H
Haojia Hui
Li Auto
K
Kaiwen Long
Li Auto
Y
Yu Wang
Tsinghua University
C
Chao Yu
Zhongguancun Academy
Wenbo Ding
Wenbo Ding
UNIVERSITY AT BUFFALO
securityMachine Learning