🤖 AI Summary
Current approaches to AI agent safety rely excessively on model robustness while lacking system-level guarantees. This work proposes treating AI models as untrusted components and, for the first time, systematically integrates established principles from operating systems, networking, and formal methods to construct a multi-layered defense framework that enforces safety invariants. By synergistically combining system security mechanisms, adversarial machine learning defenses, and formal verification techniques, this paradigm offers predictable and rigorous safety assurances. The effectiveness of the proposed approach is validated through an analysis of 11 real-world agent attack cases, which also reveals key research challenges in achieving comprehensive system-level AI safety.
📝 Abstract
We take the position that agent security must be approached as a systems problem: the AI model powering the agent must be treated as an untrusted component, and security invariants must be enforced at the system level. Through this lens, efforts to increase model robustness (the dominant viewpoint in the community) are insufficient on their own. Instead, we must complement existing efforts with techniques from the systems security domain. Based on our experience as cybersecurity researchers in operating systems, networks, formal methods, and adversarial machine learning, we articulate a set of core principles, grounded in decades of systems security research, that provide a foundation for designing agentic systems with predictable guarantees. As evidence, we analyze eleven representative real-world attacks on agents and discuss how systems principles, if realized, could have prevented these attacks. We also identify the research challenges that stand in the way of implementing these principles in agents.