SWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering?

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation of agent skills in real-world software engineering tasks by introducing SWE-Skills-Bench, the first requirement-driven benchmark. It pairs 49 publicly available software engineering skills with real GitHub repositories and explicit acceptance criteria to construct a controlled evaluation framework that isolates the marginal utility of individual skills. The benchmark incorporates a deterministic verification mechanism based on executable tests. Experimental results reveal that among the 49 evaluated skills, only 7 significantly improve task success rates—by up to 30%—while 39 show no measurable benefit, yielding an average performance gain of merely +1.2%. Notably, certain skills even degrade performance due to context mismatch, underscoring the importance of rigorous, task-aligned skill assessment.

Technology Category

Application Category

📝 Abstract

Agent skills, structured procedural knowledge packages injected at inference time, are increasingly used to augment LLM agents on software engineering tasks. However, their real utility in end-to-end development settings remains unclear. We present SWE-Skills-Bench, the first requirement-driven benchmark that isolates the marginal utility of agent skills in real-world software engineering (SWE). It pairs 49 public SWE skills with authentic GitHub repositories pinned at fixed commits and requirement documents with explicit acceptance criteria, yielding approximately 565 task instances across six SWE subdomains. We introduce a deterministic verification framework that maps each task's acceptance criteria to execution-based tests, enabling controlled paired evaluation with and without the skill. Our results show that skill injection benefits are far more limited than rapid adoption suggests: 39 of 49 skills yield zero pass-rate improvement, and the average gain is only +1.2%. Token overhead varies from modest savings to a 451% increase while pass rates remain unchanged. Only seven specialized skills produce meaningful gains (up to +30%), while three degrade performance (up to -10%) due to version-mismatched guidance conflicting with project context. These findings suggest that agent skills are a narrow intervention whose utility depends strongly on domain fit, abstraction level, and contextual compatibility. SWE-Skills-Bench provides a testbed for evaluating the design, selection, and deployment of skills in software engineering agents. SWE-Skills-Bench is available at https://github.com/GeniusHTX/SWE-Skills-Bench.

Problem

Research questions and friction points this paper is trying to address.

agent skills

software engineering

LLM agents

skill utility

real-world tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

agent skills

software engineering benchmark

deterministic verification