LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing approaches struggle to jointly optimize role design, capability allocation, and dependency construction in multi-agent systems, leading to insufficient global coordination and imprecise credit assignment. This work proposes LEMON—a large language model–based executable multi-agent orchestrator that introduces a local counterfactual signaling mechanism to enable fine-grained editing of role, capability, or dependency fields at the orchestration level. By performing reward comparison only on modified components, LEMON achieves precise credit assignment and end-to-end joint optimization. Leveraging an orchestration-level GRPO objective, LEMON directly generates deployable multi-agent specifications and significantly outperforms existing methods across six benchmarks—MMLU, GSM8K, AQuA, MultiArith, SVAMP, and HumanEval—establishing state-of-the-art performance.

📝 Abstract

Large language models (LLMs) have become a strong foundation for multi-agent systems, but their effectiveness depends heavily on orchestration design. Across different tasks, role design, capacity assignment, and dependency construction jointly affect both solution quality and execution efficiency. Existing approaches automate parts of this design process, yet they often optimize these decisions partially or sequentially, and rely on execution-level feedback that provides limited credit assignment for local orchestration decisions. We propose LEMON (\textbf{L}earning \textbf{E}xecutable \textbf{M}ulti-agent \textbf{O}rchestratio\textbf{N} via Counterfactual Reinforcement Learning), an LLM-based orchestrator that generates an executable orchestration specification. The specification integrates task-specific roles, customized duties, capacity levels, and dependency structure into a single deployable system. To train the orchestrator, we augment the orchestration-level GRPO objective with a localized counterfactual signal that edits role, capacity, or dependency fields and applies the resulting reward contrast only to the edited spans. Experiments on six reasoning and coding benchmarks, including MMLU, GSM8K, AQuA, MultiArith, SVAMP, and HumanEval, show that LEMON achieves state-of-the-art performance among the evaluated multi-agent orchestration methods. Our code is available at https://anonymous.4open.science/r/LEMON-B23C.

Problem

Research questions and friction points this paper is trying to address.

multi-agent orchestration

role design

capacity assignment

dependency construction

credit assignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual Reinforcement Learning

Multi-Agent Orchestration

Executable Specification