🤖 AI Summary
This work addresses the limitations of existing multi-agent systems for large language models, which often rely on manual design with poor generalization or automated approaches hindered by code generation failures and rigid templates. The authors propose the first method that formulates multi-agent system generation as an evolutionary process within a structured configuration space. By leveraging execution trajectory feedback to guide mutation and crossover operations, and integrating an experience memory bank with a powerful language model (e.g., Claude-4.5-Sonnet) for co-optimization, the approach jointly ensures executability, robustness, and architectural expressiveness. Evaluated on BBEH, SWE-Bench, and WorkBench, the method significantly outperforms both handcrafted designs and prior automated techniques, achieving gains of 10.5 points on BBEH, 7.1 points on WorkBench, and a 79.1% pass rate on SWE-Bench-Verified.
📝 Abstract
Large language model (LLM)-based multi-agent systems (MAS) show strong promise for complex reasoning, planning, and tool-augmented tasks, but designing effective MAS architectures remains labor-intensive, brittle, and hard to generalize. Existing automatic MAS generation methods either rely on code generation, which often leads to executability and robustness failures, or impose rigid architectural templates that limit expressiveness and adaptability. We propose Evolutionary Generation of Multi-Agent Systems (EvoMAS), which formulates MAS generation as structured configuration generation. EvoMAS performs evolutionary generation in configuration space. Specifically, EvoMAS selects initial configurations from a pool, applies feedback-conditioned mutation and crossover guided by execution traces, and iteratively refines both the candidate pool and an experience memory. We evaluate EvoMAS on diverse benchmarks, including BBEH, SWE-Bench, and WorkBench, covering reasoning, software engineering, and tool-use tasks. EvoMAS consistently improves task performance over both human-designed MAS and prior automatic MAS generation methods, while producing generated systems with higher executability and runtime robustness. EvoMAS outperforms the agent evolution method EvoAgent by +10.5 points on BBEH reasoning and +7.1 points on WorkBench. With Claude-4.5-Sonnet, EvoMAS also reaches 79.1% on SWE-Bench-Verified, matching the top of the leaderboard.