🤖 AI Summary
This work identifies three critical bottlenecks in meta-agent automated design: low cross-iteration learning efficiency, insufficient agent behavioral diversity limiting collaboration, and design costs often exceeding performance gains. To address these, we propose an evolutionary in-context learning strategy that selectively retains high-value historical agents and dynamically optimizes architecture search, significantly accelerating iterative convergence and improving agent performance. We further quantify the economic feasibility boundary for multi-agent deployment. Extensive experiments on two benchmark datasets show that our method achieves lower total cost than human design, yet performance gains rarely offset design overhead in most scenarios. The evolutionary mechanism boosts average agent performance by 23.6%, empirically validating the pivotal role of behavioral diversity in collaborative efficacy. This study establishes a systematic analytical framework and scalable optimization pathway for balancing efficiency, diversity, and cost-effectiveness in meta-agent systems.
📝 Abstract
Recent works began to automate the design of agentic systems using meta-agents that propose and iteratively refine new agent architectures. In this paper, we examine three key challenges in a common class of meta-agents. First, we investigate how a meta-agent learns across iterations and find that simply expanding the context with all previous agents, as proposed by previous works, performs worse than ignoring prior designs entirely. We show that the performance improves with an evolutionary approach. Second, although the meta-agent designs multiple agents during training, it typically commits to a single agent at test time. We find that the designed agents have low behavioral diversity, limiting the potential for their complementary use. Third, we assess when automated design is economically viable. We find that only in a few cases--specifically, two datasets--the overall cost of designing and deploying the agents is lower than that of human-designed agents when deployed on over 15,000 examples. In contrast, the performance gains for other datasets do not justify the design cost, regardless of scale.