Learning Robust Social Strategies with Large Language Models

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

In multi-agent social dilemmas, standard reinforcement learning often induces large language model (LLM) agents to adopt defection strategies, undermining collective welfare. To address this, we propose a cooperative multi-agent training framework tailored for LLMs: first, we design Trust and Split—a novel benchmark environment requiring natural-language negotiation; second, we introduce a population-relative baseline to simplify advantage estimation and develop Opponent-Aware Advantage Alignment, a policy-coordination algorithm that jointly optimizes agent strategies. Our method significantly improves collective payoff across multiple social dilemma tasks (average +23.6%), yielding policies with high cooperation rates (>85%) and robust exploit-resistance—effectively countering greedy adversarial strategies. Moreover, the learned policies generalize seamlessly to state-of-the-art closed-source LLMs.

Technology Category

Application Category

📝 Abstract

As agentic AI becomes more widespread, agents with distinct and possibly conflicting goals will interact in complex ways. These multi-agent interactions pose a fundamental challenge, particularly in social dilemmas, where agents' individual incentives can undermine collective welfare. While reinforcement learning (RL) has been effective for aligning large language models (LLMs) in the single-agent regime, prior small-network results suggest that standard RL in multi-agent settings often converges to defecting, self-interested policies. We show the same effect in LLMs: despite cooperative priors, RL-trained LLM agents develop opportunistic behavior that can exploit even advanced closed-source models. To address this tendency of RL to converge to poor equilibria, we adapt a recent opponent-learning awareness algorithm, Advantage Alignment, to fine-tune LLMs toward multi-agent cooperation and non-exploitability. We then introduce a group-relative baseline that simplifies advantage computation in iterated games, enabling multi-agent training at LLM scale. We also contribute a novel social dilemma environment, Trust and Split, which requires natural language communication to achieve high collective welfare. Across a wide range of social dilemmas, policies learned with Advantage Alignment achieve higher collective payoffs while remaining robust against exploitation by greedy agents.

Problem

Research questions and friction points this paper is trying to address.

Multi-agent interactions create social dilemmas undermining collective welfare

Reinforcement learning in multi-agent settings converges to self-interested policies

LLM agents develop opportunistic behavior that exploits other advanced models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapted Advantage Alignment algorithm for multi-agent cooperation

Introduced group-relative baseline for simplified advantage computation

Created Trust and Split environment requiring natural language communication

🔎 Similar Papers

No similar papers found.