π€ AI Summary
This work addresses the vulnerability of AI agents to indirect prompt injection (IPI) attacks during tool use, where existing defenses often degrade utility or introduce latency through overzealous sanitization. The authors propose CausalArmor, a lightweight defense framework that leverages leave-one-out causal attribution to selectively sanitize inputs only when the influence of untrusted content on privileged operations exceeds that of the userβs intended input. CausalArmor further integrates a retrospective chain-of-thought masking mechanism to disrupt poisoned reasoning paths. The method provides theoretical guarantees: attribution-based sanitization exponentially reduces the probability of malicious behavior. Evaluated on AgentDojo and DoomArena benchmarks, CausalArmor achieves high security without compromising agent utility or responsiveness, while significantly enhancing interpretability.
π Abstract
AI agents equipped with tool-calling capabilities are susceptible to Indirect Prompt Injection (IPI) attacks. In this attack scenario, malicious commands hidden within untrusted content trick the agent into performing unauthorized actions. Existing defenses can reduce attack success but often suffer from the over-defense dilemma: they deploy expensive, always-on sanitization regardless of actual threat, thereby degrading utility and latency even in benign scenarios. We revisit IPI through a causal ablation perspective: a successful injection manifests as a dominance shift where the user request no longer provides decisive support for the agent's privileged action, while a particular untrusted segment, such as a retrieved document or tool output, provides disproportionate attributable influence. Based on this signature, we propose CausalArmor, a selective defense framework that (i) computes lightweight, leave-one-out ablation-based attributions at privileged decision points, and (ii) triggers targeted sanitization only when an untrusted segment dominates the user intent. Additionally, CausalArmor employs retroactive Chain-of-Thought masking to prevent the agent from acting on ``poisoned''reasoning traces. We present a theoretical analysis showing that sanitization based on attribution margins conditionally yields an exponentially small upper bound on the probability of selecting malicious actions. Experiments on AgentDojo and DoomArena demonstrate that CausalArmor matches the security of aggressive defenses while improving explainability and preserving utility and latency of AI agents.