WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This paper addresses the lack of systematic benchmarks for detecting prompt injection attacks in web agent scenarios by introducing the first multimodal (text + image) prompt injection detection benchmark tailored for web agents. Methodologically, it proposes a fine-grained attack taxonomy and generates adversarial samples—covering both implicit instructions and imperceptible perturbations—based on a principled threat model; it further conducts comprehensive evaluation across diverse detection algorithms. Key contributions include: (1) releasing the first annotated multimodal dataset featuring diverse attack types (explicit/implicit, perceptible/imperceptible); (2) empirically demonstrating significant performance degradation of existing detectors under implicit-instruction and imperceptible-perturbation settings; and (3) open-sourcing all data, annotations, and code to advance research on robust and secure web agents.

Technology Category

Application Category

📝 Abstract

Multiple prompt injection attacks have been proposed against web agents. At the same time, various methods have been developed to detect general prompt injection attacks, but none have been systematically evaluated for web agents. In this work, we bridge this gap by presenting the first comprehensive benchmark study on detecting prompt injection attacks targeting web agents. We begin by introducing a fine-grained categorization of such attacks based on the threat model. We then construct datasets containing both malicious and benign samples: malicious text segments generated by different attacks, benign text segments from four categories, malicious images produced by attacks, and benign images from two categories. Next, we systematize both text-based and image-based detection methods. Finally, we evaluate their performance across multiple scenarios. Our key findings show that while some detectors can identify attacks that rely on explicit textual instructions or visible image perturbations with moderate to high accuracy, they largely fail against attacks that omit explicit instructions or employ imperceptible perturbations. Our datasets and code are released at: https://github.com/Norrrrrrr-lyn/WAInjectBench.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking prompt injection detection methods for web agents

Evaluating text and image-based detectors across multiple attack scenarios

Assessing detection gaps against implicit instructions and imperceptible perturbations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced first benchmark for web agent injection detection

Systematized text and image based detection methods

Evaluated detector performance across multiple attack scenarios

🔎 Similar Papers

Formalizing and Benchmarking Prompt Injection Attacks and Defenses