Mastering the Game of Go with Self-play Experience Replay

πŸ“… 2026-01-06
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the challenge of mastering Go through model-free reinforcement learning without relying on Monte Carlo tree search (MCTS) or human data. To this end, the authors propose QZero, an algorithm that integrates entropy-regularized Q-learning, self-play, and off-policy experience replay, enabling a single Q-network to learn a Nash equilibrium strategy from scratch. This work demonstrates for the first time that model-free methods can efficiently solve Goβ€”a large-scale, complex two-player gameβ€”and successfully extends off-policy learning to high-dimensional policy spaces. Experimental results show that QZero achieves performance comparable to AlphaGo using only seven GPUs over five months of training, significantly improving both training efficiency and scalability.

Technology Category

Application Category

πŸ“ Abstract
The game of Go has long served as a benchmark for artificial intelligence, demanding sophisticated strategic reasoning and long-term planning. Previous approaches such as AlphaGo and its successors, have predominantly relied on model-based Monte-Carlo Tree Search (MCTS). In this work, we present QZero, a novel model-free reinforcement learning algorithm that forgoes search during training and learns a Nash equilibrium policy through self-play and off-policy experience replay. Built upon entropy-regularized Q-learning, QZero utilizes a single Q-value network to unify policy evaluation and improvement. Starting tabula rasa without human data and trained for 5 months with modest compute resources (7 GPUs), QZero achieved a performance level comparable to that of AlphaGo. This demonstrates, for the first time, the efficiency of using model-free reinforcement learning to master the game of Go, as well as the feasibility of off-policy reinforcement learning in solving large-scale and complex environments.
Problem

Research questions and friction points this paper is trying to address.

Go
model-free reinforcement learning
self-play
off-policy reinforcement learning
Nash equilibrium
Innovation

Methods, ideas, or system contributions that make the work stand out.

model-free reinforcement learning
self-play
off-policy experience replay
entropy-regularized Q-learning
Go AI
πŸ”Ž Similar Papers
No similar papers found.