MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning

📅 2025-02-25

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Off-the-shelf large language models (LLMs) exhibit weak collaborative capabilities and poor generalization across tasks and domains. Method: This paper proposes MAPoRL—a Multi-Agent Policy Optimization via Reinforcement Learning paradigm—that establishes a multi-round closed loop of “autonomous generation → collaborative discussion → joint reinforcement learning verification” to explicitly model and elicit synergistic multi-model behavior. Its core innovation is the first multi-agent joint post-training framework, which formalizes collaborative dialogue as a rewardable policy and introduces a learnable verifier to generate co-training reinforcement signals, thereby overcoming limitations of single-model fine-tuning. Contribution/Results: By integrating multi-agent reinforcement learning, collaborative dialogue modeling, and joint policy optimization, MAPoRL significantly outperforms single-model fine-tuning methods on multiple collaborative benchmarks and demonstrates strong zero-shot generalization to unseen tasks and domains.

Technology Category

Application Category

📝 Abstract

Leveraging multiple large language models (LLMs) to build collaborative multi-agentic workflows has demonstrated significant potential. However, most previous studies focus on prompting the out-of-the-box LLMs, relying on their innate capability for collaboration, which may not improve LLMs' performance as shown recently. In this paper, we introduce a new post-training paradigm MAPoRL (Multi-Agent Post-co-training for collaborative LLMs with Reinforcement Learning), to explicitly elicit the collaborative behaviors and further unleash the power of multi-agentic LLM frameworks. In MAPoRL, multiple LLMs first generate their own responses independently and engage in a multi-turn discussion to collaboratively improve the final answer. In the end, a MAPoRL verifier evaluates both the answer and the discussion, by assigning a score that verifies the correctness of the answer, while adding incentives to encourage corrective and persuasive discussions. The score serves as the co-training reward, and is then maximized through multi-agent RL. Unlike existing LLM post-training paradigms, MAPoRL advocates the co-training of multiple LLMs together using RL for better generalization. Accompanied by analytical insights, our experiments demonstrate that training individual LLMs alone is insufficient to induce effective collaboration. In contrast, multi-agent co-training can boost the collaboration performance across benchmarks, with generalization to unseen domains.

Problem

Research questions and friction points this paper is trying to address.

Enhance collaboration in multi-agent LLMs

Introduce MAPoRL for post-training optimization

Improve generalization in unseen domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent Reinforcement Learning

Collaborative LLM Post-Training

Verification-Incentivized Discussion

🔎 Similar Papers

Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning