π€ AI Summary
Existing self-evolving agents are largely confined to prompt rewriting or failure retrying, failing to achieve a substantive transition from general-purpose agents to high-precision domain experts.
Method: This paper introduces a self-evolving framework grounded in the Model Context Protocol (MCP), systematically supporting the generation, abstraction, and reuse of domain expertise. It integrates memory augmentation, tool invocation, multi-source feedback, and retrieval-augmented MCP selection, coupled with a lightweight MCP executor to optimize inference paths.
Contribution/Results: Evaluated on the GAIA benchmark, our approach achieves 83.03% pass@1 and 89.09% pass@3βsurpassing prior methods while reducing computational cost by 15%. To our knowledge, this is the first framework enabling reusable and evolvable domain expert construction, establishing a novel paradigm for scalable, adaptive agent specialization.
π Abstract
Large language models (LLMs) have been shown to perform better when scaffolded into agents with memory, tools, and feedback. Beyond this, self-evolving agents have emerged, but current work largely limits adaptation to prompt rewriting or failure retries. Therefore, we present ALITA-G, a self-evolution framework that transforms a general-purpose agent into a domain expert by systematically generating, abstracting, and curating Model Context Protocol (MCP) tools. In this framework, a generalist agent executes a curated suite of target-domain tasks and synthesizes candidate MCPs from successful trajectories. These are then abstracted to parameterized primitives and consolidated into an MCP Box. At inference time, ALITA-G performs retrieval-augmented MCP selection with the help of each tool's descriptions and use cases, before executing an agent equipped with the MCP Executor. Across several benchmarks GAIA, PathVQA, and Humanity's Last Exam, ALITA-G attains strong gains while reducing computation costs. On GAIA validation, it achieves 83.03% pass@1 and 89.09% pass@3, establishing a new state-of-the-art result while reducing mean tokens per example by approximately 15% relative to a strong baseline agent. ALITA-G thus provides a principled pathway from generalist capability to reusable, domain-specific competence, improving both accuracy and efficiency on complex reasoning tasks.