Matching Multiple Experts: On the Exploitability of Multi-Agent Imitation Learning

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the challenge that offline multi-agent imitation learning often fails to guarantee learned policies are close to a Nash equilibrium, resulting in high exploitability. The paper establishes theoretical hardness results for learning low-exploitability strategies in general n-player Markov games and proposes a new approach under assumptions of dominant strategies or best-response continuity. By integrating behavioral cloning, measure matching, and Nash gap analysis—augmented with regularization that implicitly promotes best-response continuity—the authors prove that, under expert demonstrations drawn from a dominant-strategy equilibrium, the Nash imitation gap is bounded by O(nε_BC/(1−γ)²). This result is further extended to more general settings satisfying best-response continuity, providing the first theoretical guarantees for learning low-exploitability policies in offline multi-agent imitation learning.

Technology Category

Application Category

📝 Abstract

Multi-agent imitation learning (MA-IL) aims to learn optimal policies from expert demonstrations of interactions in multi-agent interactive domains. Despite existing guarantees on the performance of the resulting learned policies, characterizations of how far the learned polices are from a Nash equilibrium are missing for offline MA-IL. In this paper, we demonstrate impossibility and hardness results of learning low-exploitable policies in general $n$-player Markov Games. We do so by providing examples where even exact measure matching fails, and demonstrating a new hardness result on characterizing the Nash gap given a fixed measure matching error. We then show how these challenges can be overcome using strategic dominance assumptions on the expert equilibrium. Specifically, for the case of dominant strategy expert equilibria, assuming Behavioral Cloning error $\epsilon_{\text{BC}}$, this provides a Nash imitation gap of $\mathcal{O}\left(n\epsilon_{\text{BC}}/(1-\gamma)^2\right)$ for a discount factor $\gamma$. We generalize this result with a new notion of best-response continuity, and argue that this is implicitly encouraged by standard regularization techniques.

Problem

Research questions and friction points this paper is trying to address.

multi-agent imitation learning

Nash equilibrium

exploitability

Markov games

measure matching

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent imitation learning

Nash equilibrium

exploitability