Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs

📅 2025-11-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multi-agent large language models (MAS-LLMs) suffer from a fundamental tension between exploration—divergent search over solution spaces—and convergence—synthesis of optimal solutions—leading to premature consensus, error propagation, and inaccurate credit assignment. This paper proposes Maestro, a framework that orchestrates agent roles to enable parallel, diverse exploration while enforcing centralized, aggregated evaluation. We introduce Conditional Listwise Policy Optimization (CLPO), a novel algorithm that decouples reward signals for strategic decision-making from those for tactical reasoning, enabling fine-grained credit assignment and strong contrastive supervision. Maestro integrates multi-agent architecture, policy gradient methods, and listwise ranking loss. Evaluated on mathematical reasoning and general problem-solving benchmarks, it achieves an average accuracy improvement of 6% (up to 10%), substantially outperforming current state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Multi-agent systems (MAS) built on Large Language Models (LLMs) are being used to approach complex problems and can surpass single model inference. However, their success hinges on navigating a fundamental cognitive tension: the need to balance broad, divergent exploration of the solution space with a principled, convergent synthesis to the optimal solution. Existing paradigms often struggle to manage this duality, leading to premature consensus, error propagation, and a critical credit assignment problem that fails to distinguish between genuine reasoning and superficially plausible arguments. To resolve this core challenge, we propose the Multi-Agent Exploration-Synthesis framework Through Role Orchestration (Maestro), a principled paradigm for collaboration that structurally decouples these cognitive modes. Maestro uses a collective of parallel Execution Agents for diverse exploration and a specialized Central Agent for convergent, evaluative synthesis. To operationalize this critical synthesis phase, we introduce Conditional Listwise Policy Optimization (CLPO), a reinforcement learning objective that disentangles signals for strategic decisions and tactical rationales. By combining decision-focused policy gradients with a list-wise ranking loss over justifications, CLPO achieves clean credit assignment and stronger comparative supervision. Experiments on mathematical reasoning and general problem-solving benchmarks demonstrate that Maestro, coupled with CLPO, consistently outperforms existing state-of-the-art multi-agent approaches, delivering absolute accuracy gains of 6% on average and up to 10% at best.
Problem

Research questions and friction points this paper is trying to address.

Balancing exploration and synthesis in multi-agent LLM systems
Resolving premature consensus and error propagation issues
Addressing credit assignment between strategic and tactical decisions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples exploration and synthesis via role orchestration
Uses Conditional Listwise Policy Optimization for credit assignment
Combines policy gradients with list-wise ranking loss
🔎 Similar Papers
No similar papers found.