🤖 AI Summary
This study investigates why incorporating procedural knowledge (Skills) into tool-augmented agents fails to improve—and may even degrade—performance in offensive cybersecurity tasks. Through a reanalysis of a controlled experiment comprising 180 runs, the authors systematically evaluate the impact of Skills on CTF agents under varying levels of documentation richness. They identify “environmental feedback bandwidth” as a critical moderator: in high-feedback-bandwidth environments, the benefits of Skills diminish significantly or become detrimental. Using agents based on the MCP architecture and an ablation design with four documentation richness levels, statistical analyses—including chi-square tests, Cochran–Armitage trend tests, and Cohen’s h effect sizes—reveal only an 8.9-percentage-point performance difference between full-Skills and no-Skills conditions (p = 0.71), with most effect sizes falling below the threshold for a small effect, indicating minimal or negative marginal utility of Skills in such tasks.
📝 Abstract
Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task pass rates by an average of 16.2~percentage points across diverse domains. Yet the same benchmarks show wide variance, with 16 of 84 tasks suffering negative deltas when Skills are introduced. The community has not yet articulated a clean mechanism for \emph{when} Skills help and when they are merely redundant overhead. We re-analyze a recently published 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag (CTF) agent under four documentation conditions of increasing richness (55, 1{,}478, 1{,}976, and 4{,}147 lines), and show that these conditions correspond almost exactly to a No-Skills, Experiential-Skills, Curated-Skills, and Comprehensive-Skills ablation. In offensive cybersecurity, a domain not deeply covered by existing Skills benchmarks, the marginal benefit of Skills collapses. The spread between the no-Skills and full-Skills conditions is only 8.9~pp ($p = 0.71$, $χ^2$; $p = 0.25$, Cochran--Armitage trend test; five of six pairwise Cohen's $h$ values fall below the $0.2$ small-effect threshold). We argue that the missing variable is \emph{environment-feedback bandwidth}. When an agent's tool layer returns strict, schema-validated, low-latency observations, the environment itself supplies the procedural correction signal that Skills are normally needed to provide. As a result, the marginal benefit of curated Skills diminishes substantially, and, in some cases (e.g., our timing side-channel setting), actively degrades performance. We articulate a falsifiable hypothesis, sketch its design implications for compound AI systems, and will release the reanalysis pipeline to support replication.