🤖 AI Summary
In multi-agent games, opponent modeling often relies on domain-specific heuristics, and computing approximate optimal responses remains intractable in large-scale imperfect-information settings.
Method: This paper extends the Policy-Space Response Oracles (PSRO) framework by integrating Monte Carlo Tree Search (MCTS), conditional generative world-state sampling, and Nash Bargaining Solution (NBS)-driven meta-strategy computation. It pioneers the incorporation of generative modeling into MARL search, introduces two novel NBS-based meta-strategy solvers, and enables online Bayesian co-player policy prediction.
Results: Evaluated on Colored Trails and Deal or No Deal—two canonical imperfect-information negotiation benchmarks—the algorithm converges toward Nash equilibria. In human-subject experiments involving 346 participants, it achieves social welfare comparable to human–human negotiations, while significantly improving human–AI collaboration efficiency and fairness.
📝 Abstract
Multiagent reinforcement learning (MARL) has benefited significantly from population-based and game-theoretic training regimes. One approach, Policy-Space Response Oracles (PSRO), employs standard reinforcement learning to compute response policies via approximate best responses and combines them via meta-strategy selection. We augment PSRO by adding a novel search procedure with generative sampling of world states, and introduce two new meta-strategy solvers based on the Nash bargaining solution. We evaluate PSRO's ability to compute approximate Nash equilibrium, and its performance in two negotiation games: Colored Trails, and Deal or No Deal. We conduct behavioral studies where human participants negotiate with our agents ($N = 346$). We find that search with generative modeling finds stronger policies during both training time and test time, enables online Bayesian co-player prediction, and can produce agents that achieve comparable social welfare negotiating with humans as humans trading among themselves.