🤖 AI Summary
This work addresses underexplored security risks in skill-based agent systems, particularly the vulnerability of skill implementations to backdoor attacks. The authors propose SkillTrojan, the first attack framework targeting the skill implementation layer, which stealthily embeds malicious logic into standard skill composition pipelines through encrypted payload fragmentation, trigger-condition binding, and automated skill template synthesis. The embedded payload activates only under specific conditions, enabling covert exploitation while preserving normal functionality. SkillTrojan supports scalable propagation across heterogeneous skill ecosystems, exposing a critical gap in existing architectures: the absence of security validation for composed skills. Experiments demonstrate a 97.2% attack success rate on EHR SQL tasks, with the compromised model retaining 89.3% accuracy on benign tasks using GPT-5.2-1211-Global. The authors also release a dataset containing over 3,000 poisoned skills to facilitate further research.
📝 Abstract
Skill-based agent systems tackle complex tasks by composing reusable skills, improving modularity and scalability while introducing a largely unexamined security attack surface. We propose SkillTrojan, a backdoor attack that targets skill implementations rather than model parameters or training data. SkillTrojan embeds malicious logic inside otherwise plausible skills and leverages standard skill composition to reconstruct and execute an attacker-specified payload. The attack partitions an encrypted payload across multiple benign-looking skill invocations and activates only under a predefined trigger. SkillTrojan also supports automated synthesis of backdoored skills from arbitrary skill templates, enabling scalable propagation across skill-based agent ecosystems. To enable systematic evaluation, we release a dataset of 3,000+ curated backdoored skills spanning diverse skill patterns and trigger-payload configurations. We instantiate SkillTrojan in a representative code-based agent setting and evaluate both clean-task utility and attack success rate. Our results show that skill-level backdoors can be highly effective with minimal degradation of benign behavior, exposing a critical blind spot in current skill-based agent architectures and motivating defenses that explicitly reason about skill composition and execution. Concretely, on EHR SQL, SkillTrojan attains up to 97.2% ASR while maintaining 89.3% clean ACC on GPT-5.2-1211-Global.