How Adversarial Environments Mislead Agentic AI?

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses a critical trust gap in existing tool-integrated agents, which exhibit insufficient robustness against environmental deception when external tool outputs are tampered with. The authors propose the Adversarial Environment Injection (AEI) threat model—the first formalization of how environmental deception impacts agent behavior—and introduce POTEMKIN, a plug-and-play evaluation framework built on the Model Context Protocol (MCP). POTEMKIN enables systematic red-teaming via adversarial retrieval poisoning and structural trap techniques. The study uncovers a fundamental trade-off between cognitive and navigational robustness and defines two orthogonal attack surfaces: breadth attacks (“The Illusion”) and depth attacks (“The Maze”). Extensive experiments across five state-of-the-art agents (>11,000 trials) demonstrate that improving resilience to one attack type often exacerbates vulnerability to the other, confirming their intrinsic distinction.

Technology Category

Application Category

📝 Abstract

Tool-integrated agents are deployed on the premise that external tools ground their outputs in reality. Yet this very reliance creates a critical attack surface. Current evaluations benchmark capability in benign settings, asking "can the agent use tools correctly" but never "what if the tools lie". We identify this Trust Gap: agents are evaluated for performance, not for skepticism. We formalize this vulnerability as Adversarial Environmental Injection (AEI), a threat model where adversaries compromise tool outputs to deceive agents. AEI constitutes environmental deception: constructing a "fake world" of poisoned search results and fabricated reference networks around unsuspecting agents. We operationalize this via POTEMKIN, a Model Context Protocol (MCP)-compatible harness for plug-and-play robustness testing. We identify two orthogonal attack surfaces: The Illusion (breadth attacks) poison retrieval to induce epistemic drift toward false beliefs, while The Maze (depth attacks) exploit structural traps to cause policy collapse into infinite loops. Across 11,000+ runs on five frontier agents, we find a stark robustness gap: resistance to one attack often increases vulnerability to the other, demonstrating that epistemic and navigational robustness are distinct capabilities.

Problem

Research questions and friction points this paper is trying to address.

Adversarial Environmental Injection

Trust Gap

Environmental Deception

Tool-integrated Agents

Robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial Environmental Injection

Trust Gap

Tool-integrated Agents