🤖 AI Summary
Flow-based policies in off-policy reinforcement learning suffer from ill-conditioned gradients and training instability due to multi-step action sampling in rollout procedures. Method: This paper establishes, for the first time, the algebraic equivalence between flow-based rollouts and residual recurrent computation, and proposes a velocity reparameterization framework for sequence modeling. Based on this, we design two novel stable architectures—Gated Velocity Network (Flow-G) and Decoded Velocity Network (Flow-T)—enabling end-to-end differentiable training without policy distillation or auxiliary objectives. We further integrate noise-augmented rollouts into the Soft Actor-Critic (SAC) framework. Contribution/Results: Experiments demonstrate state-of-the-art performance on continuous control and robotic manipulation benchmarks, significantly improving sample efficiency. The approach supports both zero-shot learning and offline-to-online transfer learning, establishing a robust and scalable foundation for flow-based policy optimization.
📝 Abstract
Training expressive flow-based policies with off-policy reinforcement learning is notoriously unstable due to gradient pathologies in the multi-step action sampling process. We trace this instability to a fundamental connection: the flow rollout is algebraically equivalent to a residual recurrent computation, making it susceptible to the same vanishing and exploding gradients as RNNs. To address this, we reparameterize the velocity network using principles from modern sequential models, introducing two stable architectures: Flow-G, which incorporates a gated velocity, and Flow-T, which utilizes a decoded velocity. We then develop a practical SAC-based algorithm, enabled by a noise-augmented rollout, that facilitates direct end-to-end training of these policies. Our approach supports both from-scratch and offline-to-online learning and achieves state-of-the-art performance on continuous control and robotic manipulation benchmarks, eliminating the need for common workarounds like policy distillation or surrogate objectives.