Simultaneous AlphaZero: Extending Tree Search to Markov Games

📅 2025-12-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two-player, zero-sum, deterministic Markov games with synchronous actions. We propose the first general method extending the AlphaZero framework to this class of games. Our core innovation introduces a bandit-feedback-based regret-minimization matrix game solver at each Monte Carlo Tree Search (MCTS) node, jointly optimizing immediate rewards and value-network predictions to efficiently compute Nash-equilibrium joint actions. Crucially, we achieve end-to-end co-training of the value network, policy network, and online game solver—a novel capability not previously realized. We validate our approach on continuous-state, discrete-action pursuit–evasion and satellite-guardian tasks. The learned policies exhibit robust competitiveness against optimal adversaries, significantly improving both policy quality and convergence stability in synchronous adversarial settings.

Technology Category

Application Category

📝 Abstract
Simultaneous AlphaZero extends the AlphaZero framework to multistep, two-player zero-sum deterministic Markov games with simultaneous actions. At each decision point, joint action selection is resolved via matrix games whose payoffs incorporate both immediate rewards and future value estimates. To handle uncertainty arising from bandit feedback during Monte Carlo Tree Search (MCTS), Simultaneous AlphaZero incorporates a regret-optimal solver for matrix games with bandit feedback. Simultaneous AlphaZero demonstrates robust strategies in a continuous-state discrete-action pursuit-evasion game and satellite custody maintenance scenarios, even when evaluated against maximally exploitative opponents.
Problem

Research questions and friction points this paper is trying to address.

Extends AlphaZero to simultaneous-action Markov games
Resolves joint actions via matrix games with rewards
Handles bandit feedback uncertainty with regret-optimal solvers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends AlphaZero to simultaneous-action Markov games
Resolves joint actions via matrix games with future estimates
Uses regret-optimal solver for bandit feedback in MCTS
🔎 Similar Papers
No similar papers found.