🤖 AI Summary
Existing inverse reinforcement learning (IRL) methods for general-sum and mixed-motive multi-agent games struggle to jointly model cooperative and competitive objectives, leading to biased reward inference and suboptimal policies. To address this, we propose MAMQL—a novel framework that establishes the first theoretical connection between marginal Q-functions and Boltzmann policies, bridging multi-agent marginal Q-learning with single-agent soft Q-IRL. MAMQL achieves joint modeling of latent reward functions through marginalized critic networks, a soft Q-optimization objective, and deep integration of IRL with multi-agent reinforcement learning (MARL). Evaluated across three diverse simulated environments, MAMQL consistently outperforms state-of-the-art baselines by 2–5× in average return, sample efficiency, and reward recovery accuracy—demonstrating its effectiveness and generalizability in mixed-motive settings.
📝 Abstract
When reward functions are hand-designed, deep reinforcement learning algorithms often suffer from reward misspecification, causing them to learn suboptimal policies in terms of the intended task objectives. In the single-agent case, inverse reinforcement learning (IRL) techniques attempt to address this issue by inferring the reward function from expert demonstrations. However, in multi-agent problems, misalignment between the learned and true objectives is exacerbated due to increased environment non-stationarity and variance that scales with multiple agents. As such, in multi-agent general-sum games, multi-agent IRL algorithms have difficulty balancing cooperative and competitive objectives. To address these issues, we propose Multi-Agent Marginal Q-Learning from Demonstrations (MAMQL), a novel sample-efficient framework for multi-agent IRL. For each agent, MAMQL learns a critic marginalized over the other agents' policies, allowing for a well-motivated use of Boltzmann policies in the multi-agent context. We identify a connection between optimal marginalized critics and single-agent soft-Q IRL, allowing us to apply a direct, simple optimization criterion from the single-agent domain. Across our experiments on three different simulated domains, MAMQL significantly outperforms previous multi-agent methods in average reward, sample efficiency, and reward recovery by often more than 2-5x. We make our code available at https://sites.google.com/view/mamql .