Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

πŸ“… 2026-03-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the vulnerability of multimodal web agents to cross-modal adversarial attacks that simultaneously manipulate screenshots and accessibility treesβ€”a challenge that existing defenses struggle to mitigate, particularly when visual deception is involved. To this end, the authors propose DMAST, a novel framework that formalizes cross-modal attack and defense as a two-player zero-sum Markov game. DMAST employs a three-stage co-training paradigm: strong-teacher imitation learning, zero-shot policy supervised fine-tuning, and GRPO-based adversarial self-play, enabling the co-evolution of both agent and attacker. This approach substantially enhances out-of-distribution robustness and generalization, doubling task-completion efficiency and outperforming current training- and prompting-based defense strategies.

Technology Category

Application Category

πŸ“ Abstract
Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects content into the webpage DOM simultaneously corrupts both observation channels with a consistent deceptive narrative. Our vulnerability analysis on MiniWob++ reveals that attacks including a visual component far outperform text-only injections, exposing critical gaps in text-centric VLM safety training. Motivated by this finding, we propose Dual-Modality Multi-Stage Adversarial Safety Training (DMAST), a framework that formalizes the agent-attacker interaction as a two-player zero-sum Markov game and co-trains both players through a three-stage pipeline: (1) imitation learning from a strong teacher model, (2) oracle-guided supervised fine-tuning that uses a novel zero-acknowledgment strategy to instill task-focused reasoning under adversarial noise, and (3) adversarial reinforcement learning via Group Relative Policy Optimization (GRPO) self-play. On out-of-distribution tasks, DMAST substantially mitigates adversarial risks while simultaneously doubling task completion efficiency. Our approach significantly outperforms established training-based and prompt-based defenses, demonstrating genuine co-evolutionary progress and robust generalization to complex, unseen environments.
Problem

Research questions and friction points this paper is trying to address.

multimodal web agents
cross-modal attacks
adversarial safety
dual-stream architecture
DOM injection
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal web agents
cross-modal attacks
adversarial safety training
zero-sum Markov game
Group Relative Policy Optimization
πŸ”Ž Similar Papers
No similar papers found.