When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This study investigates why incorporating procedural knowledge (Skills) into tool-augmented agents fails to improve—and may even degrade—performance in offensive cybersecurity tasks. Through a reanalysis of a controlled experiment comprising 180 runs, the authors systematically evaluate the impact of Skills on CTF agents under varying levels of documentation richness. They identify “environmental feedback bandwidth” as a critical moderator: in high-feedback-bandwidth environments, the benefits of Skills diminish significantly or become detrimental. Using agents based on the MCP architecture and an ablation design with four documentation richness levels, statistical analyses—including chi-square tests, Cochran–Armitage trend tests, and Cohen’s h effect sizes—reveal only an 8.9-percentage-point performance difference between full-Skills and no-Skills conditions (p = 0.71), with most effect sizes falling below the threshold for a small effect, indicating minimal or negative marginal utility of Skills in such tasks.

📝 Abstract

Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task pass rates by an average of 16.2~percentage points across diverse domains. Yet the same benchmarks show wide variance, with 16 of 84 tasks suffering negative deltas when Skills are introduced. The community has not yet articulated a clean mechanism for \emph{when} Skills help and when they are merely redundant overhead. We re-analyze a recently published 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag (CTF) agent under four documentation conditions of increasing richness (55, 1{,}478, 1{,}976, and 4{,}147 lines), and show that these conditions correspond almost exactly to a No-Skills, Experiential-Skills, Curated-Skills, and Comprehensive-Skills ablation. In offensive cybersecurity, a domain not deeply covered by existing Skills benchmarks, the marginal benefit of Skills collapses. The spread between the no-Skills and full-Skills conditions is only 8.9~pp ($p = 0.71$, $χ^2$; $p = 0.25$, Cochran--Armitage trend test; five of six pairwise Cohen's $h$ values fall below the $0.2$ small-effect threshold). We argue that the missing variable is \emph{environment-feedback bandwidth}. When an agent's tool layer returns strict, schema-validated, low-latency observations, the environment itself supplies the procedural correction signal that Skills are normally needed to provide. As a result, the marginal benefit of curated Skills diminishes substantially, and, in some cases (e.g., our timing side-channel setting), actively degrades performance. We articulate a falsifiable hypothesis, sketch its design implications for compound AI systems, and will release the reanalysis pipeline to support replication.

Problem

Research questions and friction points this paper is trying to address.

Skills

Tool-Grounded Agents

Offensive Cybersecurity

Procedural Knowledge

Environment-Feedback Bandwidth

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent Skills

environment-feedback bandwidth

tool-grounded agents