MARS: Reinforcing Multi-Agent Reasoning of LLMs through Self-Play in Strategic Games

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing multi-agent systems suffer from limited LLM reasoning capabilities due to challenges in long-horizon credit assignment and insufficient agent-specific advantage estimation. To address this, we propose MARS, an end-to-end reinforcement learning framework that introduces episode-level advantage estimation and agent-specific advantage normalization. MARS employs self-play training within cooperative/competitive strategic games to improve both the accuracy of long-term credit assignment and the generalization of learned policies. Experiments based on the Qwen3-4B model demonstrate significant performance gains: +28.7% on target game tasks, +10.0% on AIME, and +12.5% on GPQA-Diamond. These results indicate markedly enhanced multi-agent collaborative reasoning and effective transferability to non-game domains.

Technology Category

Application Category

📝 Abstract

Developing Large Language Models (LLMs) to cooperate and compete effectively within multi-agent systems is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single-agent tasks, its extension to multi-turn, multi-agent scenarios remains underexplored due to the challenges of long-horizon credit assignment and agent-specific advantage estimation. To address these challenges, we introduce MARS, an end-to-end RL framework that incentivizes Multi-Agent Reasoning of LLMs through Self-play in both cooperative and competitive games. MARS features a turn-level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent-specific advantage normalization to stabilize multi-agent training. By learning with self-play across cooperative and competitive games, the MARS agent trained from Qwen3-4B develops strong strategic abilities that generalize to held-out games with up to 28.7% performance improvements. More importantly, the capability acquired through self-play generalizes beyond games, yielding consistent performance gains of multi-agent systems in reasoning benchmarks. When integrated into leading multi-agent systems, our MARS agent achieves significant performance gains of 10.0% on AIME and 12.5% on GPQA-Diamond. These results establish end-to-end RL training with self-play in strategic games as a powerful approach for developing generalizable multi-agent reasoning capabilities in LLMs. Our code and models are publicly available at https://github.com/thu-nics/MARS.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-agent reasoning in LLMs through self-play games

Addressing long-horizon credit assignment in multi-agent systems

Improving strategic generalization across cooperative and competitive scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses self-play in strategic games for training

Features turn-level advantage estimator for credit assignment

Applies agent-specific advantage normalization to stabilize training

🔎 Similar Papers

Self-playing Adversarial Language Game Enhances LLM Reasoning