🤖 AI Summary
This work addresses the risk that large language model (LLM) agents may violate their own stated safety specifications—even in the absence of adversarial attacks—leading to hazardous outcomes such as unintended file deletion or credential leakage. To tackle this, the authors propose Sefz, a novel framework that formulates specification violation detection as a graph query problem over execution traces. Sefz automatically uncovers vulnerabilities by translating natural-language safety rules into reachability objectives and employs an LLM-driven input mutation strategy guided by a multi-armed bandit algorithm, enabling the discovery of deep logical flaws without requiring adversarial examples. Evaluated on 402 real-world agent skills, Sefz identified 120 instances of specification violations (29.9%), including 26 previously unknown exploitable vulnerabilities, and distilled six common design flaws underlying these failures.
📝 Abstract
LLM-powered agents can silently delete documents, leak credentials, or transfer funds on a routine user request, not because the agent was attacked, but because the skill it invoked broke its own declared safety rules. We call these specification violations: benign inputs cause a skill to breach the natural-language guardrails in its own specification, typically because the guardrail's semantics are undefined for autonomous execution, or because the implementation silently ignores the documented constraint. These violations are invisible to static analyzers, traditional fuzzers, and prompt-injection defenses alike, yet they undermine the very contract a user trusts when installing a skill.
We present Sefz, a goal-directed semantic fuzzing framework that automatically discovers specification violations in agent skills. Sefz translates each guardrail into a reachability goal over an annotated execution trace, reducing violation checking to a deterministic graph query. An LLM-based mutator generates benign inputs whose traces progressively approach the violation patterns, guided by a multi-armed bandit that uses goal-proximity as its reward signal.
On 402 real-world skills from the largest public agent-skill marketplace, Sefz finds specification violations in 120 (29.9%), including 26 previously unknown exploitable guardrail violations in deployed skills. Six recurring specification pitfalls explain the bulk of the failures, suggesting concrete principles for safer skill design.