🤖 AI Summary
This paper addresses the model-free learning of Nash equilibria in zero-sum linear-quadratic (LQ) games with unknown system dynamics. We propose a nested zeroth-order natural policy gradient algorithm that unifies single-point and two-point gradient estimators, achieving— for the first time—sample complexities of $mathcal{O}(varepsilon^{-2})$ (two-point) and $mathcal{O}(varepsilon^{-3})$ (single-point), substantially improving upon existing polynomial upper bounds. The algorithm incorporates implicit regularization to ensure controller robustness and is theoretically guaranteed to converge to an $varepsilon$-neighborhood of the Nash equilibrium. Key contributions are: (1) a lightweight nested iterative architecture that avoids costly high-dimensional Hessian estimation; (2) a tight analytical framework coupling zeroth-order optimization with LQ game policy gradients; and (3) state-of-the-art sample efficiency attained without system identification or explicit model knowledge.
📝 Abstract
Zero-sum Linear Quadratic (LQ) games are fundamental in optimal control and can be used (i) as a dynamic game formulation for risk-sensitive or robust control, or (ii) as a benchmark setting for multi-agent reinforcement learning with two competing agents in continuous state-control spaces. In contrast to the well-studied single-agent linear quadratic regulator problem, zero-sum LQ games entail solving a challenging nonconvex-nonconcave min-max problem with an objective function that lacks coercivity. Recently, Zhang et al. [1] discovered an implicit regularization property of natural policy gradient methods which is crucial for safety-critical control systems since it preserves the robustness of the controller during learning. Moreover, in the model-free setting where the knowledge of model parameters is not available, Zhang et al. proposed the first polynomial sample complexity algorithm to reach an $epsilon$ neighborhood of the Nash equilibrium while maintaining the desirable implicit regularization property. In this work, we propose a simpler nested Zeroth-Order (ZO) algorithm improving sample complexity by several orders of magnitude. Our main result guarantees a $ ilde{mathcal{O}}(epsilon^{-3})$ sample complexity under the same assumptions using a single-point ZO estimator. Furthermore, when the estimator is replaced by a two-point estimator, our method enjoys a better $ ilde{mathcal{O}}(epsilon^{-2})$ sample complexity. Our key improvements rely on a more sample-efficient nested algorithm design and finer control of the ZO natural gradient estimation error.