SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

📅 2026-02-13
📈 Citations: 0
Influential: 0
📄 PDF

Technology Category

Application Category

📝 Abstract
Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.
Problem

Research questions and friction points this paper is trying to address.

Agent Skills
benchmarking
LLM agents
procedural knowledge
task performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

SkillsBench
Agent Skills
LLM agents
Skill evaluation
Procedural knowledge
🔎 Similar Papers
No similar papers found.
X
Xiangyi Li
BenchFlow
W
Wenbo Chen
Amazon
Y
Yimin Liu
Ohio State University
S
Shenghan Zheng
Dartmouth College
X
Xiaokun Chen
Stanford University
Y
Yifeng He
UC Davis
Yubo Li
Yubo Li
Carnegie Mellon University
AI in HealthcareLarge Language ModelsAI Alignment
B
Bingran You
UC Berkeley
Haotian Shen
Haotian Shen
Hybrid Systems Lab, UC Berkeley
control theory
J
Jiankai Sun
Independent
S
Shuyi Wang
Independent
Q
Qunhong Zeng
Beijing Institute of Technology
D
Di Wang
Foxconn
Xuandong Zhao
Xuandong Zhao
UC Berkeley
Machine LearningNatural Language ProcessingAI Safety
Yuanli Wang
Yuanli Wang
Boston University
Distributed SystemsMLSysLarge Language ModelsAgentic AI
R
Roey Ben Chaim
Zenity
Z
Zonglin Di
UC Santa Cruz
Yipeng Gao
Yipeng Gao
University of Southern California
Generative AIComputer Vision
Junwei He
Junwei He
Institute of Computing Technology, Chinese Academy of Sciences
LLM ReasoningGraph Learning
Y
Yizhuo He
Carnegie Mellon University
Liqiang Jing
Liqiang Jing
University of Texas at Dallas
Multimedia AnalysisMultimodalNatural Language Processing
L
Luyang Kong
Independent
X
Xin Lan
Michigan State University
J
Jiachen Li
UT Austin
S
Songlin Li
Stanford University