LlamaFirewall: An open source guardrail system for building secure AI agents

📅 2025-05-06

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Large language models (LLMs) are evolving into autonomous AI agents that process untrusted inputs—such as web pages and emails—introducing critical security risks including prompt injection, objective misalignment, and unsafe code generation, which existing defenses struggle to mitigate. This paper proposes the first real-time safety guard framework specifically designed for AI agents, featuring three novel dynamic protection mechanisms: (1) PromptGuard 2, a state-of-the-art jailbreak detection module; (2) a Chain-of-Thought Alignment Auditor to strengthen defense against indirect prompt injection; and (3) CodeShield, a lightweight, scalable online static analysis engine for unsafe code interception. The framework supports customizable policies via both regex- and LLM-based prompting. Extensive experiments demonstrate SOTA performance across prompt injection detection, objective alignment verification, and unsafe code blocking. All components are open-sourced as a modular, plug-and-play system.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have evolved from simple chatbots into autonomous agents capable of performing complex tasks such as editing production code, orchestrating workflows, and taking higher-stakes actions based on untrusted inputs like webpages and emails. These capabilities introduce new security risks that existing security measures, such as model fine-tuning or chatbot-focused guardrails, do not fully address. Given the higher stakes and the absence of deterministic solutions to mitigate these risks, there is a critical need for a real-time guardrail monitor to serve as a final layer of defense, and support system level, use case specific safety policy definition and enforcement. We introduce LlamaFirewall, an open-source security focused guardrail framework designed to serve as a final layer of defense against security risks associated with AI Agents. Our framework mitigates risks such as prompt injection, agent misalignment, and insecure code risks through three powerful guardrails: PromptGuard 2, a universal jailbreak detector that demonstrates clear state of the art performance; Agent Alignment Checks, a chain-of-thought auditor that inspects agent reasoning for prompt injection and goal misalignment, which, while still experimental, shows stronger efficacy at preventing indirect injections in general scenarios than previously proposed approaches; and CodeShield, an online static analysis engine that is both fast and extensible, aimed at preventing the generation of insecure or dangerous code by coding agents. Additionally, we include easy-to-use customizable scanners that make it possible for any developer who can write a regular expression or an LLM prompt to quickly update an agent's security guardrails.

Problem

Research questions and friction points this paper is trying to address.

Address security risks in autonomous AI agents

Mitigate prompt injection and agent misalignment

Prevent insecure code generation by AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source guardrail framework for AI security

Universal jailbreak detector with state-of-the-art performance

Online static analysis engine for secure code generation

🔎 Similar Papers

The Emerged Security and Privacy of LLM Agent: A Survey with Case Studies