🤖 AI Summary
This paper addresses the weak theoretical foundations of multi-agent imitation learning (MAIL). First, it establishes the first statistical lower bound for non-interactive MAIL, identifying the “full-policy deviation concentration coefficient” as the fundamental complexity measure. Second, it proposes MAIL-WARM, an interactive framework that integrates reward-free reinforcement learning with interactive imitation learning, using behavior cloning as a baseline to achieve near-optimal sample efficiency. Third, it improves the optimal sample complexity from 𝒪(ε⁻⁸) to 𝒪(ε⁻²), matching the derived lower bound and achieving minimax-optimal convergence. Empirical validation on benchmark environments—including grid-world domains—confirms the method’s efficacy and scalability.
📝 Abstract
We close open theoretical gaps in Multi-Agent Imitation Learning (MAIL) by characterizing the limits of non-interactive MAIL and presenting the first interactive algorithm with near-optimal sample complexity. In the non-interactive setting, we prove a statistical lower bound that identifies the all-policy deviation concentrability coefficient as the fundamental complexity measure, and we show that Behavior Cloning (BC) is rate-optimal. For the interactive setting, we introduce a framework that combines reward-free reinforcement learning with interactive MAIL and instantiate it with an algorithm, MAIL-WARM. It improves the best previously known sample complexity from $mathcal{O}(varepsilon^{-8})$ to $mathcal{O}(varepsilon^{-2}),$ matching the dependence on $varepsilon$ implied by our lower bound. Finally, we provide numerical results that support our theory and illustrate, in environments such as grid worlds, where Behavior Cloning fails to learn.