🤖 AI Summary
This study addresses the challenge of detecting stealthy supply chain poisoning attacks in agent systems caused by third-party tools and MCP servers, which existing methods struggle to identify effectively. To this end, the authors introduce SC-Inject-Bench, the first systematically defined benchmark comprising over 25 attack types, and propose ShieldNet—a novel framework that captures real interaction traffic via a network-layer man-in-the-middle proxy and employs a lightweight classifier for runtime detection. Departing from conventional approaches that rely on superficial tool artifacts, ShieldNet models network behavior directly, substantially improving detection accuracy. Experimental results demonstrate that ShieldNet achieves an F1 score of 0.995 and a false positive rate of only 0.8% on SC-Inject-Bench, significantly outperforming current MCP scanners and semantic guardrail solutions.
📝 Abstract
Existing research on LLM agent security mainly focuses on prompt injection and unsafe input/output behaviors. However, as agents increasingly rely on third-party tools and MCP servers, a new class of supply-chain threats has emerged, where malicious behaviors are embedded in seemingly benign tools, silently hijacking agent execution, leaking sensitive data, or triggering unauthorized actions. Despite their growing impact, there is currently no comprehensive benchmark for evaluating such threats. To bridge this gap, we introduce SC-Inject-Bench, a large-scale benchmark comprising over 10,000 malicious MCP tools grounded in a taxonomy of 25+ attack types derived from MITRE ATT&CK targeting supply-chain threats. We observe that existing MCP scanners and semantic guardrails perform poorly on this benchmark. Motivated by this finding, we propose ShieldNet, a network-level guardrail framework that detects supply-chain poisoning by observing real network interactions rather than surface-level tool traces. ShieldNet integrates a man-in-the-middle (MITM) proxy and an event extractor to identify critical network behaviors, which are then processed by a lightweight classifier for attack detection. Extensive experiments show that ShieldNet achieves strong detection performance (up to 0.995 F-1 with only 0.8% false positives) while introducing little runtime overhead, substantially outperforming existing MCP scanners and LLM-based guardrails.