Mastering the Game of Go with Self-play Experience Replay

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This study addresses the challenge of mastering Go through model-free reinforcement learning without relying on Monte Carlo tree search (MCTS) or human data. To this end, the authors propose QZero, an algorithm that integrates entropy-regularized Q-learning, self-play, and off-policy experience replay, enabling a single Q-network to learn a Nash equilibrium strategy from scratch. This work demonstrates for the first time that model-free methods can efficiently solve Go—a large-scale, complex two-player game—and successfully extends off-policy learning to high-dimensional policy spaces. Experimental results show that QZero achieves performance comparable to AlphaGo using only seven GPUs over five months of training, significantly improving both training efficiency and scalability.

Technology Category

Application Category

📝 Abstract

The game of Go has long served as a benchmark for artificial intelligence, demanding sophisticated strategic reasoning and long-term planning. Previous approaches such as AlphaGo and its successors, have predominantly relied on model-based Monte-Carlo Tree Search (MCTS). In this work, we present QZero, a novel model-free reinforcement learning algorithm that forgoes search during training and learns a Nash equilibrium policy through self-play and off-policy experience replay. Built upon entropy-regularized Q-learning, QZero utilizes a single Q-value network to unify policy evaluation and improvement. Starting tabula rasa without human data and trained for 5 months with modest compute resources (7 GPUs), QZero achieved a performance level comparable to that of AlphaGo. This demonstrates, for the first time, the efficiency of using model-free reinforcement learning to master the game of Go, as well as the feasibility of off-policy reinforcement learning in solving large-scale and complex environments.

Problem

Research questions and friction points this paper is trying to address.

model-free reinforcement learning

self-play

off-policy reinforcement learning

Nash equilibrium

Innovation

Methods, ideas, or system contributions that make the work stand out.

model-free reinforcement learning

self-play

off-policy experience replay

entropy-regularized Q-learning