🤖 AI Summary
Diffusion Q-Learning (DQL) suffers from low training/inference efficiency and poor stability due to its reliance on multi-step denoising. To address this, we propose One-Step Flow Q-Learning (OFQL), the first method to integrate Flow Matching into the diffusion Q-learning framework. OFQL models the average velocity field over the state-action space, enabling direct, single-step action generation—eliminating the need for iterative sampling, auxiliary models, or staged training. Crucially, the velocity field is end-to-end optimized under the Q-learning objective, jointly maximizing policy quality and generation efficiency. On the D4RL benchmark, OFQL significantly outperforms DQL and other diffusion-based baselines: it achieves several-fold speedups in both training and inference, superior final performance, and markedly improved convergence stability. Our core contribution is a principled, concise, differentiable, single-step diffusion-based paradigm for policy learning.
📝 Abstract
The generative power of diffusion models (DMs) has recently enabled high-performing decision-making algorithms in offline reinforcement learning (RL), achieving state-of-the-art results across standard benchmarks. Among them, Diffusion Q-Learning (DQL) stands out as a leading method for its consistently strong performance. Nevertheless, DQL remains limited in practice due to its reliance on multi-step denoising for action generation during both training and inference. Although one-step denoising is desirable, simply applying it to DQL leads to a drastic performance drop. In this work, we revisit DQL and identify its core limitations. We then propose One-Step Flow Q-Learning (OFQL), a novel framework that enables efficient one-step action generation during both training and inference, without requiring auxiliary models, distillation, or multi-phase training. Specifically, OFQL reformulates DQL within the sample-efficient Flow Matching (FM) framework. While conventional FM induces curved generative trajectories that impede one-step generation, OFQL instead learns an average velocity field that facilitates direct, accurate action generation. Collectively, OFQL eliminates the need for multi-step sampling and recursive gradient updates in DQL, resulting in faster and more robust training and inference. Extensive experiments on the D4RL benchmark demonstrate that OFQL outperforms DQL and other diffusion-based baselines, while substantially reducing both training and inference time compared to DQL.