🤖 AI Summary
Existing LLM-based multi-agent systems suffer from capability imbalance and inefficient collaboration due to isolated fine-tuning of individual agents. To address this, we propose MOAT—a novel framework enabling joint alignment and co-optimization of planning agents and grounding (execution) agents for the first time. MOAT employs alternating optimization and phased alignment, coupled with subgoal sequence generation and a self-constructing mechanism for diverse subgoal–action pairs, ensuring non-decreasing progress and provably asymptotic convergence during training. Theoretical analysis establishes convergence guarantees, while empirical evaluation across six benchmarks demonstrates MOAT’s superiority over state-of-the-art methods: it achieves average improvements of 3.1% on in-distribution tasks and 4.4% on out-of-distribution tasks. MOAT thus introduces a formally grounded, scalable paradigm for cooperative multi-agent optimization.
📝 Abstract
The advancement of large language models (LLMs) has enabled the construction of multi-agent systems to solve complex tasks by dividing responsibilities among specialized agents, such as a planning agent for subgoal generation and a grounding agent for executing tool-use actions. Most existing methods typically fine-tune these agents independently, leading to capability gaps among them with poor coordination. To address this, we propose MOAT, a Multi-Agent Joint Alignment Tuning framework that improves agents collaboration through iterative alignment. MOAT alternates between two key stages: (1) Planning Agent Alignment, which optimizes the planning agent to generate subgoal sequences that better guide the grounding agent; and (2) Grounding Agent Improving, which fine-tunes the grounding agent using diverse subgoal-action pairs generated by the agent itself to enhance its generalization capablity. Theoretical analysis proves that MOAT ensures a non-decreasing and progressively convergent training process. Experiments across six benchmarks demonstrate that MOAT outperforms state-of-the-art baselines, achieving average improvements of 3.1% on held-in tasks and 4.4% on held-out tasks.