π€ AI Summary
This study addresses the prevalent discrepancy between declared functionalities and actual behaviors of AI agent skills, a gap exacerbated by the absence of mechanisms to verify behavioral integrity. The work formally defines, for the first time, the Behavioral Integrity Verification (BIV) problem and introduces a novel framework that integrates static code analysis with large language modelβbased capability extraction. By performing typed capability matching and generating structured evidence, the framework systematically identifies deviations between skill descriptions and implementations. The research uncovers four new classes of composite threats and distinguishes between negligent and malicious intent. Evaluation on 49,943 skills reveals that 80.0% exhibit behavioral discrepancies, and the proposed method achieves an F1 score of 0.946 in detecting malicious skills, significantly outperforming existing approaches.
π Abstract
Agent skills extend LLM agents with privileged third-party capabilities such as filesystem access, credentials, network calls, and shell execution. Existing safety work catches malicious prompts and risky runtime actions, but the skill artifact itself goes unverified. We formalize this as the behavioral integrity verification (BIV) problem: a typed set comparison between declared and actual capabilities over a shared taxonomy that bridges code, instructions, and metadata. The BIV framework instantiates this comparison by pairing deterministic code analysis with LLM-assisted capability extraction. The resulting structured evidence supports three downstream analyses: deviation taxonomy, root-cause classification, and malicious-skill detection. On 49,943 skills from the OpenClaw registry, the deviation taxonomy reveals a pervasive description-implementation gap: 80.0% of skills deviate from declared behavior, with four novel compound-threat categories surfaced. Root-cause classification finds that deviations are mostly oversight, not malice: 81.1% trace to developer oversight and 18.9% to adversarial intent, with 5.0% of skills carrying predicted multi-stage attack chains. On a 906-skill malicious-skill detection benchmark, BIV reaches an F1 of 0.946, outperforming state-of-the-art rule-based and single-pass LLM baselines. These results demonstrate behavioral integrity auditing for agent skills at scale.