π€ AI Summary
This study addresses the challenge of mastering Go through model-free reinforcement learning without relying on Monte Carlo tree search (MCTS) or human data. To this end, the authors propose QZero, an algorithm that integrates entropy-regularized Q-learning, self-play, and off-policy experience replay, enabling a single Q-network to learn a Nash equilibrium strategy from scratch. This work demonstrates for the first time that model-free methods can efficiently solve Goβa large-scale, complex two-player gameβand successfully extends off-policy learning to high-dimensional policy spaces. Experimental results show that QZero achieves performance comparable to AlphaGo using only seven GPUs over five months of training, significantly improving both training efficiency and scalability.
π Abstract
The game of Go has long served as a benchmark for artificial intelligence, demanding sophisticated strategic reasoning and long-term planning. Previous approaches such as AlphaGo and its successors, have predominantly relied on model-based Monte-Carlo Tree Search (MCTS). In this work, we present QZero, a novel model-free reinforcement learning algorithm that forgoes search during training and learns a Nash equilibrium policy through self-play and off-policy experience replay. Built upon entropy-regularized Q-learning, QZero utilizes a single Q-value network to unify policy evaluation and improvement. Starting tabula rasa without human data and trained for 5 months with modest compute resources (7 GPUs), QZero achieved a performance level comparable to that of AlphaGo. This demonstrates, for the first time, the efficiency of using model-free reinforcement learning to master the game of Go, as well as the feasibility of off-policy reinforcement learning in solving large-scale and complex environments.