🤖 AI Summary
This work addresses the vulnerability of AI agents to indirect prompt injection attacks, wherein malicious instructions embedded in untrusted inputs can induce hazardous behaviors. To mitigate this risk, the authors propose a system-level defense framework that integrates dynamic re-planning, constrained perception and decision-making mechanisms, and human-in-the-loop intervention to establish a secure and controllable agent architecture. The framework employs a dual-driven safety verification mechanism—combining rule-based and model-based checks—to dynamically update security policies and rigorously enforce behavioral boundaries on the underlying model. Furthermore, the study highlights critical limitations in existing evaluation benchmarks and advocates for more robust, real-world-oriented assessments of agent resilience and human-AI interaction safety.
📝 Abstract
AI agents, predominantly powered by large language models (LLMs), are vulnerable to indirect prompt injection, in which malicious instructions embedded in untrusted data can trigger dangerous agent actions. This position paper discusses our vision for system-level defenses against indirect prompt injection attacks. We articulate three positions: (1) dynamic replanning and security policy updates are often necessary for dynamic tasks and realistic environments; (2) certain context-dependent security decisions would still require LLMs (or other learned models), but should only be made within system designs that strictly constrain what the model can observe and decide; (3) in inherently ambiguous cases, personalization and human interaction should be treated as core design considerations. In addition to our main positions, we discuss limitations of existing benchmarks that can create a false sense of utility and security. We also highlight the value of system-level defenses, which serve as the skeleton of agentic systems by structuring and controlling agent behaviors, integrating rule-based and model-based security checks, and enabling more targeted research on model robustness and human interaction.