AgentTrap: Measuring Runtime Trust Failures in Third-Party Agent Skills

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Third-party agent skills may embed malicious behaviors in high-privilege, low-supervision workflows, leading to runtime trust failures. This work presents the first systematic characterization and evaluation of such risks within the agent-skill supply chain, introducing AgentTrap—a dynamic benchmark comprising 141 tasks (including 91 malicious ones)—which integrates sandboxed execution, trajectory analysis, a 16-category safety impact taxonomy, and dynamic trigger mechanisms. Experimental results demonstrate that while mainstream large language models successfully complete explicit tasks, they consistently overlook implicit malicious side effects, underscoring the critical need for holistic runtime security assessment of the operational environment.

📝 Abstract

Third-party skills are becoming the package ecosystem for LLM agents. They package natural-language instructions, helper scripts, templates, documents, and service configuration into reusable workflows. This makes skills useful, but it also introduces a new security problem: a malicious skill does not need to ask the model to perform an obviously harmful action. Instead, it can disguise the harmful behavior as part of a routine workflow, relying on the agent to execute that workflow with high-value permissions and limited human supervision. We introduce AgentTrap, a dynamic benchmark for evaluating whether LLM agents can use third-party skills while resisting malicious runtime behavior. AgentTrap contains 141 tasks: 91 malicious tasks and 50 benign utility tasks, covering 16 security-impact dimensions grounded in agent-skill supply-chain threats. In each task, the agent receives an ordinary user request, runs with installed skills that may contain malicious workflow elements, and is executed in a sandboxed environment. AgentTrap then judges complete trajectories for attack success, blocked or refused behavior, attack-not-triggered cases, and no-attack-evidence outcomes. Our central finding is that the most informative failures are not simple jailbreaks. Models often complete the visible user task while treating unsafe side effects introduced by the skill as part of the normal workflow. This motivates runtime evaluation of the concrete model--framework--workspace environment in which users actually delegate work. Code and data are available at https://github.com/zhmzm/AgentTrap and https://huggingface.co/datasets/zhmzm/AgentTrap.

Problem

Research questions and friction points this paper is trying to address.

trust failure

third-party skills

LLM agents

runtime security

malicious workflows

Innovation

Methods, ideas, or system contributions that make the work stand out.

AgentTrap

LLM agents

third-party skills