MARS: Reinforcing Multi-Agent Reasoning of LLMs through Self-Play in Strategic Games

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multi-agent systems suffer from limited LLM reasoning capabilities due to challenges in long-horizon credit assignment and insufficient agent-specific advantage estimation. To address this, we propose MARS, an end-to-end reinforcement learning framework that introduces episode-level advantage estimation and agent-specific advantage normalization. MARS employs self-play training within cooperative/competitive strategic games to improve both the accuracy of long-term credit assignment and the generalization of learned policies. Experiments based on the Qwen3-4B model demonstrate significant performance gains: +28.7% on target game tasks, +10.0% on AIME, and +12.5% on GPQA-Diamond. These results indicate markedly enhanced multi-agent collaborative reasoning and effective transferability to non-game domains.

Technology Category

Application Category

📝 Abstract
Developing Large Language Models (LLMs) to cooperate and compete effectively within multi-agent systems is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single-agent tasks, its extension to multi-turn, multi-agent scenarios remains underexplored due to the challenges of long-horizon credit assignment and agent-specific advantage estimation. To address these challenges, we introduce MARS, an end-to-end RL framework that incentivizes Multi-Agent Reasoning of LLMs through Self-play in both cooperative and competitive games. MARS features a turn-level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent-specific advantage normalization to stabilize multi-agent training. By learning with self-play across cooperative and competitive games, the MARS agent trained from Qwen3-4B develops strong strategic abilities that generalize to held-out games with up to 28.7% performance improvements. More importantly, the capability acquired through self-play generalizes beyond games, yielding consistent performance gains of multi-agent systems in reasoning benchmarks. When integrated into leading multi-agent systems, our MARS agent achieves significant performance gains of 10.0% on AIME and 12.5% on GPQA-Diamond. These results establish end-to-end RL training with self-play in strategic games as a powerful approach for developing generalizable multi-agent reasoning capabilities in LLMs. Our code and models are publicly available at https://github.com/thu-nics/MARS.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-agent reasoning in LLMs through self-play games
Addressing long-horizon credit assignment in multi-agent systems
Improving strategic generalization across cooperative and competitive scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses self-play in strategic games for training
Features turn-level advantage estimator for credit assignment
Applies agent-specific advantage normalization to stabilize training
🔎 Similar Papers
No similar papers found.
Huining Yuan
Huining Yuan
Tsinghua University
Machine LearningReinforcement LearningWorld Models
Zelai Xu
Zelai Xu
PhD Student, Tsinghua University
Language AgentReinforcenment LearningMulti-Agent System
Zheyue Tan
Zheyue Tan
Aalto University
X
Xiangmin Yi
Tsinghua University
M
Mo Guang
Li Auto Inc.
K
Kaiwen Long
Li Auto Inc.
H
Haojia Hui
Li Auto Inc.
B
Boxun Li
Infinigence-AI
X
Xinlei Chen
Tsinghua University
B
Bo Zhao
Aalto University
X
Xiao-Ping Zhang
Tsinghua University
C
Chao Yu
Tsinghua University
Y
Yu Wang
Tsinghua University