π€ AI Summary
This work addresses security vulnerabilities in LLM-driven AI agents during multi-turn planning and tool invocation, particularly risks of adversarial jailbreaking leading to unsafe actions (e.g., cross-domain email sending, robotic arm over-extension). To this end, we propose the first unified framework for assessing agent controllability and risk exposure, introducing a novel multi-turn attack suite inspired by OWASP Top 10βcovering ten autonomous agent scenarios. We systematically evaluate the adherence of 13 open-source tool-augmented LLMs to system-level safety instructions across 37 simulated tool environments. Experimental results reveal substantial inconsistencies in policy execution fidelity across models, exposing widespread fragility in current safety boundary enforcement. Our study establishes a standardized, empirically grounded evaluation benchmark for building verifiable and auditable secure agent systems.
π Abstract
Securing AI agents powered by Large Language Models (LLMs) represents one of the most critical challenges in AI security today. Unlike traditional software, AI agents leverage LLMs as their "brain" to autonomously perform actions via connected tools. This capability introduces significant risks that go far beyond those of harmful text presented in a chatbot that was the main application of LLMs. A compromised AI agent can deliberately abuse powerful tools to perform malicious actions, in many cases irreversible, and limited solely by the guardrails on the tools themselves and the LLM ability to enforce them. This paper presents ASTRA, a first-of-its-kind framework designed to evaluate the effectiveness of LLMs in supporting the creation of secure agents that enforce custom guardrails defined at the system-prompt level (e.g., "Do not send an email out of the company domain," or "Never extend the robotic arm in more than 2 meters").
Our holistic framework simulates 10 diverse autonomous agents varying between a coding assistant and a delivery drone equipped with 37 unique tools. We test these agents against a suite of novel attacks developed specifically for agentic threats, inspired by the OWASP Top 10 but adapted to challenge the ability of the LLM for policy enforcement during multi-turn planning and execution of strict tool activation. By evaluating 13 open-source, tool-calling LLMs, we uncovered surprising and significant differences in their ability to remain secure and keep operating within their boundaries. The purpose of this work is to provide the community with a robust and unified methodology to build and validate better LLMs, ultimately pushing for more secure and reliable agentic AI systems.