SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing benchmarks lack systematic evaluation of agents’ capabilities in autonomously discovering, repairing, and continuously evolving skills. This work introduces SkillFlow, a benchmark comprising 20 task families with 166 tasks, built upon the Domain-Agnostic Execution Flow (DAEF) framework and employing a lifelong learning protocol that requires agents to start from scratch and iteratively generate skill patches driven by execution trajectories and performance scores to dynamically update their skill repertoire. This study establishes the first lifelong evaluation paradigm for autonomous skill discovery and evolution, revealing a notable disconnect between skill usage frequency and actual utility: on Claude Opus 4.6, lifelong evolution improves success rates from 62.65% to 71.08%, yet high usage does not guarantee gains—as evidenced by Kimi K2.5 (+0.60%) and Qwen-Coder-Next, which even exhibits performance degradation.

Technology Category

Application Category

📝 Abstract

As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills. Yet current benchmarks mostly test whether models can use provided skills, leaving open whether they can discover skills from experience, repair them after failure, and maintain a coherent library over time. We introduce SkillFlow, a benchmark of 166 tasks across 20 families in which task construction within each family follows a Domain-Agnostic Execution Flow (DAEF) that defines an agent workflow framework, allowing these tasks to share a consistent workflow. Agents are evaluated under an Agentic Lifelong Learning protocol in which they begin without skills, solve tasks sequentially within each family, externalize lessons through trajectory- and rubric-driven skill patches, and carry the updated library forward. Experiments reveal a substantial capability gap. For Claude Opus 4.6, lifelong skill evolution improves task success from 62.65% to 71.08% (+8.43 points). However, high skill usage does not necessarily imply high utility: Kimi K2.5 gains only +0.60 points despite 66.87% skill usage, while Qwen-Coder-Next reaches only a 44.58% task completion rate and still regresses relative to the vanilla setting. SkillFlow contributes a structured testbed for this direction and an in-depth empirical analysis of skill discovery, patching, transfer, and their failure modes under lifelong evaluation.

Problem

Research questions and friction points this paper is trying to address.

lifelong learning

skill discovery

autonomous agents

skill evolution

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

lifelong skill discovery

Domain-Agnostic Execution Flow

skill patching