π€ AI Summary
Existing large language model agents tackling multi-step professional tasks rely on handcrafted skill packages, suffering from high annotation costs and misalignment between human-designed skills and agent cognition. This work proposes EvoSkills, a framework that achieves autonomous co-evolution at the skill level for the first time. Through iterative interaction between a skill generator and an agent validator, EvoSkills automatically produces and refines structured, multi-file skill packages without requiring real-world test data. This approach transcends the limitations of conventional tool-level self-evolution paradigms, significantly outperforming existing methods on the SkillsBench benchmark. It achieves state-of-the-art or near state-of-the-art pass rates across six mainstream large language models, including Claude and Codex, demonstrating strong generalization capability.
π Abstract
Anthropic proposes the concept of skills for LLM agents to tackle multi-step professional tasks that simple tool invocations cannot address. A tool is a single, self-contained function, whereas a skill is a structured bundle of interdependent multi-file artifacts. Currently, skill generation is not only label-intensive due to manual authoring, but also may suffer from human--machine cognitive misalignment, which can lead to degraded agent performance, as evidenced by evaluations on SkillsBench. Therefore, we aim to enable agents to autonomously generate skills. However, existing self-evolving methods designed for tools cannot be directly applied to skills due to their increased complexity. To address these issues, we propose EvoSkills, a self-evolving skills framework that enables agents to autonomously construct complex, multi-file skill packages. Specifically, EvoSkills couples a Skill Generator that iteratively refines skills with a Surrogate Verifier that co-evolves to provide informative and actionable feedback without access to ground-truth test content. On SkillsBench, EvoSkills achieves the highest pass rate among five baselines on both Claude Code and Codex, and also exhibits strong generalization capabilities to six additional LLMs.