🤖 AI Summary
In multi-agent social dilemmas, standard reinforcement learning often induces large language model (LLM) agents to adopt defection strategies, undermining collective welfare. To address this, we propose a cooperative multi-agent training framework tailored for LLMs: first, we design Trust and Split—a novel benchmark environment requiring natural-language negotiation; second, we introduce a population-relative baseline to simplify advantage estimation and develop Opponent-Aware Advantage Alignment, a policy-coordination algorithm that jointly optimizes agent strategies. Our method significantly improves collective payoff across multiple social dilemma tasks (average +23.6%), yielding policies with high cooperation rates (>85%) and robust exploit-resistance—effectively countering greedy adversarial strategies. Moreover, the learned policies generalize seamlessly to state-of-the-art closed-source LLMs.
📝 Abstract
As agentic AI becomes more widespread, agents with distinct and possibly conflicting goals will interact in complex ways. These multi-agent interactions pose a fundamental challenge, particularly in social dilemmas, where agents' individual incentives can undermine collective welfare. While reinforcement learning (RL) has been effective for aligning large language models (LLMs) in the single-agent regime, prior small-network results suggest that standard RL in multi-agent settings often converges to defecting, self-interested policies. We show the same effect in LLMs: despite cooperative priors, RL-trained LLM agents develop opportunistic behavior that can exploit even advanced closed-source models. To address this tendency of RL to converge to poor equilibria, we adapt a recent opponent-learning awareness algorithm, Advantage Alignment, to fine-tune LLMs toward multi-agent cooperation and non-exploitability. We then introduce a group-relative baseline that simplifies advantage computation in iterated games, enabling multi-agent training at LLM scale. We also contribute a novel social dilemma environment, Trust and Split, which requires natural language communication to achieve high collective welfare. Across a wide range of social dilemmas, policies learned with Advantage Alignment achieve higher collective payoffs while remaining robust against exploitation by greedy agents.