🤖 AI Summary
Multi-agent large language models (MAS-LLMs) suffer from a fundamental tension between exploration—divergent search over solution spaces—and convergence—synthesis of optimal solutions—leading to premature consensus, error propagation, and inaccurate credit assignment. This paper proposes Maestro, a framework that orchestrates agent roles to enable parallel, diverse exploration while enforcing centralized, aggregated evaluation. We introduce Conditional Listwise Policy Optimization (CLPO), a novel algorithm that decouples reward signals for strategic decision-making from those for tactical reasoning, enabling fine-grained credit assignment and strong contrastive supervision. Maestro integrates multi-agent architecture, policy gradient methods, and listwise ranking loss. Evaluated on mathematical reasoning and general problem-solving benchmarks, it achieves an average accuracy improvement of 6% (up to 10%), substantially outperforming current state-of-the-art approaches.
📝 Abstract
Multi-agent systems (MAS) built on Large Language Models (LLMs) are being used to approach complex problems and can surpass single model inference. However, their success hinges on navigating a fundamental cognitive tension: the need to balance broad, divergent exploration of the solution space with a principled, convergent synthesis to the optimal solution. Existing paradigms often struggle to manage this duality, leading to premature consensus, error propagation, and a critical credit assignment problem that fails to distinguish between genuine reasoning and superficially plausible arguments. To resolve this core challenge, we propose the Multi-Agent Exploration-Synthesis framework Through Role Orchestration (Maestro), a principled paradigm for collaboration that structurally decouples these cognitive modes. Maestro uses a collective of parallel Execution Agents for diverse exploration and a specialized Central Agent for convergent, evaluative synthesis. To operationalize this critical synthesis phase, we introduce Conditional Listwise Policy Optimization (CLPO), a reinforcement learning objective that disentangles signals for strategic decisions and tactical rationales. By combining decision-focused policy gradients with a list-wise ranking loss over justifications, CLPO achieves clean credit assignment and stronger comparative supervision. Experiments on mathematical reasoning and general problem-solving benchmarks demonstrate that Maestro, coupled with CLPO, consistently outperforms existing state-of-the-art multi-agent approaches, delivering absolute accuracy gains of 6% on average and up to 10% at best.