Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the challenge of simultaneously minimizing external regret against a fixed opponent and Nash-value regret in the worst case within two-player Markov games, where the opponent’s strategy is unknown and non-stationary. To this end, the authors introduce a novel performance metric—empirical Nash-value regret—and propose a parameter-free online learning algorithm that integrates an adaptive restarting mechanism with a dynamically tuned incremental factor. This approach automatically adapts to the opponent’s non-stationarity, characterized by total variation $C$ and the number of switches $L$, achieving a unified regret bound of $O(\min\{\sqrt{K} + (CK)^{1/3}, \sqrt{LK}\})$. The method recovers the optimal $O(\sqrt{K})$ external regret when the opponent is fixed and maintains the best-known $O(K^{2/3})$ Nash-value regret in the worst case, thereby establishing the first theoretically optimal interpolation between these two regimes.

Technology Category

Application Category

📝 Abstract

We study online learning in two-player uninformed Markov games, where the opponent's actions and policies are unobserved. In this setting, Tian et al. (2021) show that achieving no-external-regret is impossible without incurring an exponential dependence on the episode length $H$. They then turn to the weaker notion of Nash-value regret and propose a V-learning algorithm with regret $O(K^{2/3})$ after $K$ episodes. However, their algorithm and guarantee do not adapt to the difficulty of the problem: even in the case where the opponent follows a fixed policy and thus $O(\sqrt{K})$ external regret is well-known to be achievable, their result is still the worse rate $O(K^{2/3})$ on a weaker metric. In this work, we fully address both limitations. First, we introduce empirical Nash-value regret, a new regret notion that is strictly stronger than Nash-value regret and naturally reduces to external regret when the opponent follows a fixed policy. Moreover, under this new metric, we propose a parameter-free algorithm that achieves an $O(\min \{\sqrt{K} + (CK)^{1/3},\sqrt{LK}\})$ regret bound, where $C$ quantifies the variance of the opponent's policies and $L$ denotes the number of policy switches (both at most $O(K)$). Therefore, our results not only recover the two extremes -- $O(\sqrt{K})$ external regret when the opponent is fixed and $O(K^{2/3})$ Nash-value regret in the worst case -- but also smoothly interpolate between these extremes by automatically adapting to the opponent's non-stationarity. We achieve so by first providing a new analysis of the epoch-based V-learning algorithm by Mao et al. (2022), establishing an $O(\eta C + \sqrt{K/\eta})$ regret bound, where $\eta$ is the epoch incremental factor. Next, we show how to adaptively restart this algorithm with an appropriate $\eta$ in response to the potential non-stationarity of the opponent, eventually achieving our final results.

Problem

Research questions and friction points this paper is trying to address.

Online Learning

Markov Games

Nash-value Regret

Non-Stationarity Adaptation

Uninformed Setting

Innovation

Methods, ideas, or system contributions that make the work stand out.

empirical Nash-value regret

non-stationarity adaptation

parameter-free online learning