MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the vulnerability of large language model (LLM)-driven web agents to indirect prompt injection attacks embedded in web pages, which can cause agent behavior to deviate from user intent. To counter this, the authors propose MUZZLE, a novel framework that introduces the first adaptive attack generation mechanism based on agent execution trajectory feedback. MUZZLE identifies high-sensitivity injection points by analyzing execution traces, dynamically crafts context-aware malicious instructions, and iteratively refines its attack strategy. This approach overcomes the limitations of traditional methods relying on fixed templates or manual selection of injection points, enabling cross-application attacks and discovery of tailored phishing scenarios. Evaluated across four web applications, MUZZLE automatically uncovered 37 new attack variants spanning ten adversarial objectives—including two cross-application injections and one agent-customized phishing attack—significantly advancing automated red-teaming capabilities.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) based web agents are increasingly deployed to automate complex online tasks by directly interacting with web sites and performing actions on users'behalf. While these agents offer powerful capabilities, their design exposes them to indirect prompt injection attacks embedded in untrusted web content, enabling adversaries to hijack agent behavior and violate user intent. Despite growing awareness of this threat, existing evaluations rely on fixed attack templates, manually selected injection surfaces, or narrowly scoped scenarios, limiting their ability to capture realistic, adaptive attacks encountered in practice. We present MUZZLE, an automated agentic framework for evaluating the security of web agents against indirect prompt injection attacks. MUZZLE utilizes the agent's trajectories to automatically identify high-salience injection surfaces, and adaptively generate context-aware malicious instructions that target violations of confidentiality, integrity, and availability. Unlike prior approaches, MUZZLE adapts its attack strategy based on the agent's observed execution trajectory and iteratively refines attacks using feedback from failed executions. We evaluate MUZZLE across diverse web applications, user tasks, and agent configurations, demonstrating its ability to automatically and adaptively assess the security of web agents with minimal human intervention. Our results show that MUZZLE effectively discovers 37 new attacks on 4 web applications with 10 adversarial objectives that violate confidentiality, availability, or privacy properties. MUZZLE also identifies novel attack strategies, including 2 cross-application prompt injection attacks and an agent-tailored phishing scenario.

Problem

Research questions and friction points this paper is trying to address.

indirect prompt injection

web agents

LLM security

adaptive attacks

red-teaming

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive red-teaming

indirect prompt injection

agentic security evaluation