🤖 AI Summary
This work addresses the high latency and inefficiency of score- and flow-based generative models in reinforcement learning, which stem from iterative sampling. To overcome these limitations, the authors propose Moment Matching Q-Learning (MoMa QL), a novel framework that introduces Maximum Mean Discrepancy (MMD) into Q-learning for the first time. By matching all-order moments between source and target distributions, MoMa QL imposes strong regularization on the conditional score function, ensuring distributional convergence and substantially improving sampling efficiency. The method achieves performance on the D4RL benchmark that is comparable to or better than state-of-the-art approaches while significantly reducing computational overhead. Furthermore, MoMa QL demonstrates faster convergence and greater adaptability during fine-tuning in offline-to-online reinforcement learning settings.
📝 Abstract
Score-based and flow-based generative models exhibit remarkable expressive capacity in capturing complex distributions, and have been extensively deployed in tasks ranging from image generation to reinforcement learning. Nevertheless, these models suffer from prolonged inference latency, which imposes a significant computational bottleneck in RL with iterative sampling. To overcome this limitation, we propose a new framework named Moment Matching Q-Learning (MoMa QL), which utilizes a technique from statistical hypothesis testing known as maximum mean discrepancy (MMD) that intend to match all orders of statistics between the original and target distribution. By enforcing strong regularization on all moment statistics, this algorithm guarantees distribution-level convergence for conditional score function and remains stable under various hyperparameters. Empirically, we show that our method MoMa QL is more computationally efficient with a comparable if not competitive performance in various D4RL tasks. Remarkably, by accelerating the action sampling process for flow-based policies, MoMa QL demonstrates superior performance in offline-to-online RL tasks because of faster and stronger adaptability for online interactive finetuning.