MALT: Improving Reasoning with Multi-Agent LLM Training

📅 2024-12-02

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

222K/year

🤖 AI Summary

To address the limitations of single-chain-of-thought reasoning in large language models (LLMs)—namely, restricted path exploration and insufficient error correction—this paper proposes MALT, a multi-agent collaborative training framework. MALT decouples reasoning into three distinct phases: generation, verification, and correction, organizing heterogeneous agents into a structured search tree. It introduces the first unsupervised, multi-role conditional value iteration mechanism, enabling end-to-end autonomous learning from both positive and negative reasoning trajectories without human annotations or teacher models. Key technical innovations include role-conditioned modeling, Monte Carlo tree sampling, off-policy value iteration, and reward backpropagation. Evaluated on MATH, GSM8K, and CSQA, MALT achieves absolute accuracy improvements of +15.66%, +7.42%, and +9.40% over baseline LLMs, respectively, significantly enhancing robustness and generalization in complex reasoning tasks.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) often produce answers with a single chain-of-thought, which restricts their ability to explore reasoning paths or self-correct flawed outputs in complex tasks. In this paper, we introduce MALT (Multi-Agent LLM Training), a novel post-training strategy that divides the reasoning process into generation, verification, and refinement steps using a sequential pipeline of heterogeneous agents. During data generation, each agent is repeatedly sampled to form a multi-agent search tree, where final outputs are graded against ground-truth data. We then apply value iteration to propagate reward signals back to each role-conditioned model, automatically producing multi-agent post-training data without human or teacher-model supervision. Our off-policy approach allows each agent to specialize by learning from correct and incorrect trajectories, ultimately improving the end-to-end reasoning chain. On MATH, GSM8K, and CSQA, MALT surpasses the same baseline LLM with a relative improvement of 15.66%, 7.42%, and 9.40% respectively, making it an important advance towards multi-agent cooperative training.

Problem

Research questions and friction points this paper is trying to address.

Enhances reasoning with multi-agent LLM training.

Improves complex task performance via multi-agent cooperation.

Automates post-training data generation without supervision.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent LLM Training

Heterogeneous agents pipeline

Value iteration reward propagation

🔎 Similar Papers

No similar papers found.