WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the vulnerability of web-based intelligent agents to multimodal prompt injection attacks, which can manipulate their behavior and lead to information leakage. To mitigate this threat without compromising agent performance, the authors propose a dedicated guard model framework that operates in parallel with the agent and decouples attack detection from reasoning. The approach innovatively integrates a reasoning-driven mechanism that leverages multimodal inputs for detection. The guard model is trained using reasoning-intensive supervised fine-tuning and reinforcement learning on a synthetically generated multimodal dataset based on GPT-5. Experimental results demonstrate that the proposed method significantly outperforms strong baselines across multiple benchmarks, effectively safeguarding agent security and task utility with zero additional latency.

Technology Category

Application Category

📝 Abstract

Web agents powered by vision-language models (VLMs) enable autonomous interaction with web environments by perceiving and acting on both visual and textual webpage content to accomplish user-specified tasks. However, they are highly vulnerable to prompt injection attacks, where adversarial instructions embedded in HTML or rendered screenshots can manipulate agent behavior and lead to harmful outcomes such as information leakage. Existing defenses, including system prompt defenses and direct fine-tuning of agents, have shown limited effectiveness. To address this issue, we propose a defense framework in which a web agent operates in parallel with a dedicated guard agent, decoupling prompt injection detection from the agent's own reasoning. Building on this framework, we introduce WebAgentGuard, a reasoning-driven, multimodal guard model for prompt injection detection. We construct a synthetic multimodal dataset using GPT-5 spanning 164 topics and 230 visual and UI design styles, and train the model via reasoning-intensive supervised fine-tuning followed by reinforcement learning. Experiments across multiple benchmarks show that WebAgentGuard consistently outperforms strong baselines while preserving agent utility, without introducing additional latency.

Problem

Research questions and friction points this paper is trying to address.

prompt injection attacks

web agents

vision-language models

adversarial instructions

information leakage

Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning-driven guard

prompt injection detection

multimodal web agents