Zero-sum turn games using Q-learning: finite computation with security guarantees

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the lack of safety guarantees in Q-learning for zero-sum turn-based games, where premature termination compromises strategy robustness. Methodologically, it proposes a safe pure saddle-point state-feedback policy learning framework that unifies Q-learning with the dynamic programming fixed-point equation to jointly optimize a single Q-function; introduces an “opponent-aware” exploration mechanism ensuring that the learned policy satisfies prescribed safety-level constraints against a predefined opponent strategy set upon termination; and supports both discounted and undiscounted cost criteria, with convergence guarantees established for finite-horizon deterministic systems. The key contribution is the first theoretical guarantee—under incomplete state exploration—of both policy safety and pure-strategy saddle-point optimality. Experimental evaluation on the Atlatl multi-agent game validates the method’s effectiveness.

Technology Category

Application Category

📝 Abstract

This paper addresses zero-sum ``turn'' games, in which only one player can make decisions at each state. We show that pure saddle-point state-feedback policies for turn games can be constructed from dynamic programming fixed-point equations for a single value function or Q-function. These fixed-points can be constructed using a suitable form of Q-learning. For discounted costs, convergence of this form of Q-learning can be established using classical techniques. For undiscounted costs, we provide a convergence result that applies to finite-time deterministic games, which we use to illustrate our results. For complex games, the Q-learning iteration must be terminated before exploring the full-state, which can lead to policies that cannot guarantee the security levels implied by the final Q-function. To mitigate this, we propose an ``opponent-informed'' exploration policy for selecting the Q-learning samples. This form of exploration can guarantee that the final Q-function provides security levels that hold, at least, against a given set of policies. A numerical demonstration for a multi-agent game, Atlatl, indicates the effectiveness of these methods.

Problem

Research questions and friction points this paper is trying to address.

Constructing saddle-point policies via Q-learning in zero-sum turn games

Ensuring security guarantees with opponent-informed exploration strategies

Analyzing convergence for both discounted and undiscounted cost scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Q-learning for zero-sum turn games

Opponent-informed exploration policy

Dynamic programming fixed-point equations

🔎 Similar Papers

Learning in Zero-Sum Markov Games: Relaxing Strong Reachability and Mixing Time Assumptions