Behavioral Integrity Verification for AI Agent Skills

πŸ“… 2026-05-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

205K/year
πŸ€– AI Summary
This study addresses the prevalent discrepancy between declared functionalities and actual behaviors of AI agent skills, a gap exacerbated by the absence of mechanisms to verify behavioral integrity. The work formally defines, for the first time, the Behavioral Integrity Verification (BIV) problem and introduces a novel framework that integrates static code analysis with large language model–based capability extraction. By performing typed capability matching and generating structured evidence, the framework systematically identifies deviations between skill descriptions and implementations. The research uncovers four new classes of composite threats and distinguishes between negligent and malicious intent. Evaluation on 49,943 skills reveals that 80.0% exhibit behavioral discrepancies, and the proposed method achieves an F1 score of 0.946 in detecting malicious skills, significantly outperforming existing approaches.
πŸ“ Abstract
Agent skills extend LLM agents with privileged third-party capabilities such as filesystem access, credentials, network calls, and shell execution. Existing safety work catches malicious prompts and risky runtime actions, but the skill artifact itself goes unverified. We formalize this as the behavioral integrity verification (BIV) problem: a typed set comparison between declared and actual capabilities over a shared taxonomy that bridges code, instructions, and metadata. The BIV framework instantiates this comparison by pairing deterministic code analysis with LLM-assisted capability extraction. The resulting structured evidence supports three downstream analyses: deviation taxonomy, root-cause classification, and malicious-skill detection. On 49,943 skills from the OpenClaw registry, the deviation taxonomy reveals a pervasive description-implementation gap: 80.0% of skills deviate from declared behavior, with four novel compound-threat categories surfaced. Root-cause classification finds that deviations are mostly oversight, not malice: 81.1% trace to developer oversight and 18.9% to adversarial intent, with 5.0% of skills carrying predicted multi-stage attack chains. On a 906-skill malicious-skill detection benchmark, BIV reaches an F1 of 0.946, outperforming state-of-the-art rule-based and single-pass LLM baselines. These results demonstrate behavioral integrity auditing for agent skills at scale.
Problem

Research questions and friction points this paper is trying to address.

Behavioral Integrity
AI Agent Skills
Capability Verification
Security Auditing
Skill Artifact
Innovation

Methods, ideas, or system contributions that make the work stand out.

Behavioral Integrity Verification
AI Agent Skills
Capability Consistency
LLM-assisted Code Analysis
Malicious Skill Detection
πŸ”Ž Similar Papers
Yuhao Wu
Yuhao Wu
Palo Alto Networks, Washington University in St. Louis
Computer SecurityPrivacy
T
Tung-Ling Li
Palo Alto Networks
H
Hongliang Liu
Palo Alto Networks