WebGuard: Building a Generalizable Guardrail for Web Agents

📅 2025-07-18

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

LLM-driven autonomous web agents pose safety risks in real-world environments by potentially executing harmful actions on websites. Method: This paper introduces the first general-purpose safety framework tailored for realistic web environments, centered on predicting webpage state changes to assess behavioral risk. It constructs a novel, manually annotated dataset of 4,939 operations spanning 22 domains, defines a three-tier risk taxonomy, and explicitly emphasizes long-tail websites and cross-domain generalization. The framework fine-tunes the Qwen2.5-VL-7B multimodal foundation model for risk classification. Contribution/Results: Risk identification accuracy improves from 37% to 80%, and high-risk action recall rises from 20% to 76%. The framework significantly enhances safety assurance for web agents and fills critical research gaps in safety evaluation and generalization validation of autonomous web agents.

Technology Category

Application Category

📝 Abstract

The rapid development of autonomous web agents powered by Large Language Models (LLMs), while greatly elevating efficiency, exposes the frontier risk of taking unintended or harmful actions. This situation underscores an urgent need for effective safety measures, akin to access controls for human users. To address this critical challenge, we introduce WebGuard, the first comprehensive dataset designed to support the assessment of web agent action risks and facilitate the development of guardrails for real-world online environments. In doing so, WebGuard specifically focuses on predicting the outcome of state-changing actions and contains 4,939 human-annotated actions from 193 websites across 22 diverse domains, including often-overlooked long-tail websites. These actions are categorized using a novel three-tier risk schema: SAFE, LOW, and HIGH. The dataset includes designated training and test splits to support evaluation under diverse generalization settings. Our initial evaluations reveal a concerning deficiency: even frontier LLMs achieve less than 60% accuracy in predicting action outcomes and less than 60% recall in lagging HIGH-risk actions, highlighting the risks of deploying current-generation agents without dedicated safeguards. We therefore investigate fine-tuning specialized guardrail models using WebGuard. We conduct comprehensive evaluations across multiple generalization settings and find that a fine-tuned Qwen2.5VL-7B model yields a substantial improvement in performance, boosting accuracy from 37% to 80% and HIGH-risk action recall from 20% to 76%. Despite these improvements, the performance still falls short of the reliability required for high-stakes deployment, where guardrails must approach near-perfect accuracy and recall.

Problem

Research questions and friction points this paper is trying to address.

Assessing risks of unintended actions by web agents

Developing guardrails for safe web agent operations

Improving accuracy in predicting high-risk action outcomes

Innovation

Methods, ideas, or system contributions that make the work stand out.

WebGuard dataset for web agent risk assessment

Three-tier risk schema: SAFE, LOW, HIGH

Fine-tuned Qwen2.5VL-7B model improves accuracy

🔎 Similar Papers

COOKIEGUARD: Characterizing and Isolating the First-Party Cookie Jar