Do Skill Descriptions Tell the Truth? Detecting Undisclosed Security Behaviors in Code-Backed LLM Skills

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Natural language descriptions of large language model (LLM) skills often omit security-relevant behaviors present in their implementations, leading users to misjudge their capabilities. To address this issue, this work proposes the first security attribute taxonomy tailored to LLM skills and constructs a source-code-level Security Property Graph (SPG) that preserves fine-grained code evidence. By integrating static program analysis with LLM-assisted reasoning, the approach detects inconsistencies between skill descriptions and their actual implementations. Evaluation on 4,556 skills demonstrates that the method achieves a precision of 84.8% and a recall of 96.5%, revealing that 9.4% of the skills exhibit undisclosed security-related behaviors and 24.3% of the descriptions are overly vague.

📝 Abstract

Programmatic skills in LLM ecosystems consist of a natural-language description and executable implementation files. Users and LLMs rely on the description to understand the skill's scope. However, the implementation may perform security-relevant operations, such as credential access, network communication, or command execution, that the description does not state. We study this description--implementation inconsistency by asking whether the implementation stays within the security-relevant scope declared in the description. We manually analyze 920 real-world programmatic skills and construct an 11-category security property taxonomy. Based on this taxonomy, we build SKILLSCOPE, which constructs source-level security property graphs (SPGs) from implementations and performs LLM-assisted consistency checking. SPG nodes retain source-level code patterns rather than abstract taxonomy labels, preserving fine-grained evidence for checking. On 4,556 programmatic skills with double-blind human review, SKILLSCOPE achieves a precision of 84.8\% and a recall of 96.5\% for identifying inconsistency. Confirmed inconsistency affects 9.4\% of skills, while cases of coarser description, in which implementation details remain within the declared scope, account for 24.3\%. Ablation experiments confirm that both the SPG and the taxonomy contribute: removing the taxonomy reduces precision from 87.8\% to 72.3\%, while removing the SPG reduces recall from 94.7\% to 79.0\%.

Problem

Research questions and friction points this paper is trying to address.

skill description

security behavior

description-implementation inconsistency

LLM skills

undisclosed functionality

Innovation

Methods, ideas, or system contributions that make the work stand out.

security property graph

description-implementation inconsistency

LLM skills