🤖 AI Summary
Existing benchmarks struggle to evaluate agents’ ability to abstract and reuse high-level tool compositions—referred to as “skills”—over long-horizon tasks. To address this gap, this work proposes SkillCraft, a novel benchmark that explicitly centers on skill formation and cross-task reuse as core evaluation dimensions. SkillCraft features highly compositional and scalable real-world tool-use scenarios, accompanied by a lightweight evaluation protocol. The framework enables large language model agents to automatically compose atomic tools into executable skills, cache them both within and across tasks, and build a persistent skill library. Experiments demonstrate that advanced agents leveraging skill reuse reduce token consumption by up to 80%, with task success rates showing strong positive correlation with their runtime tool-composition capabilities.
📝 Abstract
Real-world tool-using agents operate over long-horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but also abstracting, and reusing higher-level tool compositions. However, existing benchmarks mainly measure instance-level success under static tool sets, offering limited insight into agents' ability to acquire such reusable skills. We address this gap by introducing SkillCraft, a benchmark explicitly stress-test agent ability to form and reuse higher-level tool compositions, where we call Skills. SkillCraft features realistic, highly compositional tool-use scenarios with difficulty scaled along both quantitative and structural dimensions, designed to elicit skill abstraction and cross-task reuse. We further propose a lightweight evaluation protocol that enables agents to auto-compose atomic tools into executable Skills, cache and reuse them inside and across tasks, thereby improving efficiency while accumulating a persistent library of reusable skills. Evaluating state-of-the-art agents on SkillCraft, we observe substantial efficiency gains, with token usage reduced by up to 80% by skill saving and reuse. Moreover, success rate strongly correlates with tool composition ability at test time, underscoring compositional skill acquisition as a core capability.