EvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

πŸ“… 2026-04-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing large language model agents tackling multi-step professional tasks rely on handcrafted skill packages, suffering from high annotation costs and misalignment between human-designed skills and agent cognition. This work proposes EvoSkills, a framework that achieves autonomous co-evolution at the skill level for the first time. Through iterative interaction between a skill generator and an agent validator, EvoSkills automatically produces and refines structured, multi-file skill packages without requiring real-world test data. This approach transcends the limitations of conventional tool-level self-evolution paradigms, significantly outperforming existing methods on the SkillsBench benchmark. It achieves state-of-the-art or near state-of-the-art pass rates across six mainstream large language models, including Claude and Codex, demonstrating strong generalization capability.
πŸ“ Abstract
Anthropic proposes the concept of skills for LLM agents to tackle multi-step professional tasks that simple tool invocations cannot address. A tool is a single, self-contained function, whereas a skill is a structured bundle of interdependent multi-file artifacts. Currently, skill generation is not only label-intensive due to manual authoring, but also may suffer from human--machine cognitive misalignment, which can lead to degraded agent performance, as evidenced by evaluations on SkillsBench. Therefore, we aim to enable agents to autonomously generate skills. However, existing self-evolving methods designed for tools cannot be directly applied to skills due to their increased complexity. To address these issues, we propose EvoSkills, a self-evolving skills framework that enables agents to autonomously construct complex, multi-file skill packages. Specifically, EvoSkills couples a Skill Generator that iteratively refines skills with a Surrogate Verifier that co-evolves to provide informative and actionable feedback without access to ground-truth test content. On SkillsBench, EvoSkills achieves the highest pass rate among five baselines on both Claude Code and Codex, and also exhibits strong generalization capabilities to six additional LLMs.
Problem

Research questions and friction points this paper is trying to address.

LLM agents
skill generation
self-evolving
cognitive misalignment
multi-step tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolving
co-evolutionary verification
LLM agent skills
multi-file artifacts
surrogate verifier