Bilevel Optimization over Saddle Points of Zero-Sum Markov Games

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work studies a bilevel reinforcement learning setting where the lower level constitutes a multi-agent regularized minimax zero-sum Markov game, capturing competitive structures arising in scenarios such as incentive design. To address this problem, the authors propose PANDA, a first-order penalty method based on the Nikaido-Isoda function, which extends bilevel optimization to the zero-sum Markov game framework for the first time. Notably, PANDA requires neither hypergradients nor second-order information from the upper level and does not rely on convexity assumptions for either level. Theoretically, it converges to an ε-stationary point within Õ(ε⁻¹) iterations with a sample complexity of Õ(ε⁻³), matching the optimal rate known for single-policy MDPs in bilevel RL. Empirical results demonstrate superior performance over existing baselines.

📝 Abstract

Reinforcement learning (RL) often has a hierarchical structure, where an upper-level (UL) learner selects model parameters and a lower-level (LL) decision-making process responds, naturally leading to a bilevel optimization problem. Most existing bilevel RL methods assume a single-policy LL Markov decision process (MDP), and therefore fail to capture competitive structures arising in applications such as incentive design, where multiple policies interact. We study bilevel optimization problems in which the LL problem is a regularized min-max zero-sum Markov game and the UL objective is optimized through the saddle-point equilibrium induced by the LL game. In this work, we propose penalty-augmented Nikaido-Isoda descent-ascent (PANDA), a penalty-based first-order policy-gradient method based on the Nikaido-Isoda function. By exploiting the min-max game structure, PANDA avoids computing UL hypergradients and does not require second-order information. We prove that PANDA converges to stationary points without convexity assumptions on either the UL or LL objectives. Moreover, PANDA reaches an $ε$-stationary point in $\tilde{\mathcal{O}}(ε^{-1})$ iterations with sample complexity $\tilde{\mathcal{O}}(ε^{-3})$, matching the best-known rates for bilevel RL with single-policy LL MDPs. Experiments demonstrate the superior performance of PANDA over closely related baselines.

Problem

Research questions and friction points this paper is trying to address.

bilevel optimization

zero-sum Markov games

saddle points

reinforcement learning

min-max games

Innovation

Methods, ideas, or system contributions that make the work stand out.

bilevel optimization

zero-sum Markov games

Nikaido-Isoda function