π€ AI Summary
Multi-agent systems (MAS) face challenges in inefficient computational resource allocation and suboptimal collaboration quality during reasoning. This paper introduces MASPRMβthe first process reward model for MAS that requires no step-level human annotations. MASPRM performs backward propagation of scalar return signals to enable fine-grained, per-step, per-agent local value estimation. It supports zero-shot cross-task transfer and integrates seamlessly with beam search and Monte Carlo Tree Search (MCTS), dynamically guiding search trajectories. On GSM8K and MATH, MASPRM achieves +30.7 and +22.9 absolute improvements in exact match (EM) over single-pass decoding, respectively; moreover, a GSM8K-trained model transfers zero-shot to MATH, yielding an +8.4 EM gain. Key contributions include: (1) the first fine-grained process modeling framework for multi-agent reasoning, (2) a fully automated, annotation-free training paradigm, and (3) an efficient, runtime resource allocation mechanism that directs computation toward high-value agents and steps.
π Abstract
Practical deployment of Multi-Agent Systems (MAS) demands strong test-time performance, motivating methods that guide inference-time search and selectively spend compute to improve quality. We present the Multi-Agent System Process Reward Model (MASPRM). It assigns per-action, per-agent values to partial inter-agent transcripts and acts as an inference-time controller. MASPRM is trained from multi-agent Monte Carlo Tree Search (MCTS) rollouts without requiring step-level human annotations, by propagating returns to local targets. At inference, MASPRM guides step-level beam search and MCTS, focusing computation on promising branches and pruning early. On GSM8K and MATH, MASPRM-guided decoding with an outcome reward model (ORM) applied to the final answer, improves exact match (EM) over a single straight-through MAS pass by $+30.7$ and $+22.9$ points, respectively. A MASPRM trained on GSM8K transfers zero-shot to MATH without retraining, adding $8.4$ EM points at the same budget. MASPRM is a plug-in value model that estimates per-agent progress and complements verifier-style decoders, enabling more reliable, compute-aware multi-agent reasoning. Code: https://github.com/milad1378yz/MASPRM