🤖 AI Summary
To address the challenges in offline reinforcement learning—namely, difficulty modeling complex action distributions, training instability and high inference latency induced by iterative generation in flow-matching policies—this paper proposes Flow Q-Learning (FQL). FQL pioneers a single-step, value-driven RL training paradigm for flow-matching policies, eliminating conventional iterative sampling and recursive gradient backpropagation. It initializes a one-step flow-matching policy via behavior cloning and jointly optimizes it with Q-learning, supporting both state- and pixel-based inputs. Evaluated across 73 offline and offline-to-online tasks from OGBench and D4RL, FQL consistently outperforms existing flow-based and non-flow-based methods, achieving state-of-the-art performance. It delivers an optimal trade-off among expressive action modeling, training stability, and inference efficiency.
📝 Abstract
We present flow Q-learning (FQL), a simple and performant offline reinforcement learning (RL) method that leverages an expressive flow-matching policy to model arbitrarily complex action distributions in data. Training a flow policy with RL is a tricky problem, due to the iterative nature of the action generation process. We address this challenge by training an expressive one-step policy with RL, rather than directly guiding an iterative flow policy to maximize values. This way, we can completely avoid unstable recursive backpropagation, eliminate costly iterative action generation at test time, yet still mostly maintain expressivity. We experimentally show that FQL leads to strong performance across 73 challenging state- and pixel-based OGBench and D4RL tasks in offline RL and offline-to-online RL. Project page: https://seohong.me/projects/fql/