🤖 AI Summary
To address inefficiency, high memory overhead, and instability in deep off-policy TD learning—caused by target networks and large-capacity replay buffers—this paper proposes PQN, an online Q-learning algorithm that eliminates both target networks and experience replay. We provide the first theoretical proof that normalization techniques (e.g., LayerNorm) ensure convergence of off-policy TD methods. By integrating vectorized parallel sampling with a lightweight online Q-learning architecture, PQN achieves stable and efficient end-to-end training. On standard benchmarks—including Atari, Craftax, and SMAX—PQN matches or exceeds the performance of state-of-the-art methods such as Rainbow, PPO-RNN, and QMix, while training up to 50× faster than conventional DQN and maintaining competitive sample efficiency. This work revitalizes the Q-learning paradigm, demonstrating that simple, stable, and scalable off-policy deep RL is attainable without architectural crutches like target networks or replay buffers—offering a novel, principled pathway for off-policy deep reinforcement learning.
📝 Abstract
Q-learning played a foundational role in the field reinforcement learning (RL). However, TD algorithms with off-policy data, such as Q-learning, or nonlinear function approximation like deep neural networks require several additional tricks to stabilise training, primarily a large replay buffer and target networks. Unfortunately, the delayed updating of frozen network parameters in the target network harms the sample efficiency and, similarly, the large replay buffer introduces memory and implementation overheads. In this paper, we investigate whether it is possible to accelerate and simplify off-policy TD training while maintaining its stability. Our key theoretical result demonstrates for the first time that regularisation techniques such as LayerNorm can yield provably convergent TD algorithms without the need for a target network or replay buffer, even with off-policy data. Empirically, we find that online, parallelised sampling enabled by vectorised environments stabilises training without the need for a large replay buffer. Motivated by these findings, we propose PQN, our simplified deep online Q-Learning algorithm. Surprisingly, this simple algorithm is competitive with more complex methods like: Rainbow in Atari, PPO-RNN in Craftax, QMix in Smax, and can be up to 50x faster than traditional DQN without sacrificing sample efficiency. In an era where PPO has become the go-to RL algorithm, PQN reestablishes off-policy Q-learning as a viable alternative.