MASPRM: Multi-Agent System Process Reward Model

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Multi-agent systems (MAS) face challenges in inefficient computational resource allocation and suboptimal collaboration quality during reasoning. This paper introduces MASPRM—the first process reward model for MAS that requires no step-level human annotations. MASPRM performs backward propagation of scalar return signals to enable fine-grained, per-step, per-agent local value estimation. It supports zero-shot cross-task transfer and integrates seamlessly with beam search and Monte Carlo Tree Search (MCTS), dynamically guiding search trajectories. On GSM8K and MATH, MASPRM achieves +30.7 and +22.9 absolute improvements in exact match (EM) over single-pass decoding, respectively; moreover, a GSM8K-trained model transfers zero-shot to MATH, yielding an +8.4 EM gain. Key contributions include: (1) the first fine-grained process modeling framework for multi-agent reasoning, (2) a fully automated, annotation-free training paradigm, and (3) an efficient, runtime resource allocation mechanism that directs computation toward high-value agents and steps.

Technology Category

Application Category

📝 Abstract

Practical deployment of Multi-Agent Systems (MAS) demands strong test-time performance, motivating methods that guide inference-time search and selectively spend compute to improve quality. We present the Multi-Agent System Process Reward Model (MASPRM). It assigns per-action, per-agent values to partial inter-agent transcripts and acts as an inference-time controller. MASPRM is trained from multi-agent Monte Carlo Tree Search (MCTS) rollouts without requiring step-level human annotations, by propagating returns to local targets. At inference, MASPRM guides step-level beam search and MCTS, focusing computation on promising branches and pruning early. On GSM8K and MATH, MASPRM-guided decoding with an outcome reward model (ORM) applied to the final answer, improves exact match (EM) over a single straight-through MAS pass by $+30.7$ and $+22.9$ points, respectively. A MASPRM trained on GSM8K transfers zero-shot to MATH without retraining, adding $8.4$ EM points at the same budget. MASPRM is a plug-in value model that estimates per-agent progress and complements verifier-style decoders, enabling more reliable, compute-aware multi-agent reasoning. Code: https://github.com/milad1378yz/MASPRM

Problem

Research questions and friction points this paper is trying to address.

Improving multi-agent system inference-time performance through guided search

Training reward models without step-level human annotations

Enabling compute-aware reasoning with plug-in value models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Assigns per-action values to agent transcripts

Trained from MCTS rollouts without human annotations

Guides beam search and MCTS for computation focus

🔎 Similar Papers

Adaptive Task Allocation in Multi-Human Multi-Robot Teams under Team Heterogeneity and Dynamic Information Uncertainty