🤖 AI Summary
To address the high computational cost and training complexity of large language model (LLM)-based multi-agent frameworks, this paper proposes a hierarchical multi-agent architecture in which only a single leader LLM is trained to orchestrate multiple parameter-free, untrained peer agents. Our core contribution is Multi-Agent-Guided Leader Policy Optimization (MLPO), a novel end-to-end training method that integrates black-box policy gradients with response synthesis—requiring no auxiliary networks, explicit feedback, or external supervision signals. The framework supports standalone deployment of the leader agent and exhibits strong generalization. Empirical evaluation on BBH, MATH, and MMLU benchmarks demonstrates substantial improvements over both single-agent baselines and state-of-the-art multi-agent approaches, achieving superior performance on complex reasoning tasks while simultaneously reducing inference latency and training overhead. These results validate the method’s effectiveness, efficiency, and scalability.
📝 Abstract
Large Language Models (LLMs) have achieved strong performance on a wide range of complex reasoning tasks, yet further gains are often possible by leveraging the complementary strengths of multiple models. While multi-agent frameworks can improve solution quality by leveraging multiple LLMs, existing methods are often computationally expensive, both at training and inference time. In this work, we introduce a hierarchical multi-agent framework that addresses these challenges by training only a single leader LLM to coordinate a team of untrained peer agents. To this end, we propose Multi-agent guided Leader Policy extbf{O}ptimization (MLPO), a novel approach which trains the leader to evaluate and synthesize agent responses without auxiliary value networks or explicit agent feedback. Leaders trained with MLPO exhibit improved performance not only when interacting with the agent team at inference time, but also enjoy improved performance when deployed in single-agent settings without the team. Empirical results on Big-Bench Hard (BBH), MATH, and MMLU demonstrate that our framework achieves substantial performance improvements over both single-agent and multi-agent baselines. Our results highlight the effectiveness and efficiency of training a single, flexible leader for collaborative reasoning in multi-agent LLM systems.