Dissecting Adversarial Robustness of Multimodal LM Agents

📅 2024-06-18

📈 Citations: 8

✨ Influential: 1

career value

222K/year

🤖 AI Summary

Evaluating the adversarial robustness of multimodal LMs—structured as multi-component agents—in realistic web environments remains challenging. Method: We propose ARE, the first benchmark framework for vision-language interaction scenarios, built upon VisualWebArena and comprising 200 targeted adversarial tasks. ARE models agents as intermediate output flow graphs and introduces an information-flow decomposition-based robustness metric. It integrates imperceptible image perturbations (<5% pixel change) with modular attribution analysis to localize vulnerabilities. Contribution/Results: Our experiments reveal that inference-time computational enhancements—particularly reflection evaluators and tree-search value functions—are critical failure points. Against state-of-the-art black-box multimodal agents, targeted hijacking succeeds up to 67%; attack-induced degradation of evaluators and value functions increases success rates by 15% and 20%, respectively—demonstrating that architectural modularity does not inherently confer robustness and exposing fundamental fragilities in current agent designs.

Technology Category

Application Category

📝 Abstract

As language models (LMs) are used to build autonomous agents in real environments, ensuring their adversarial robustness becomes a critical challenge. Unlike chatbots, agents are compound systems with multiple components taking actions, which existing LMs safety evaluations do not adequately address. To bridge this gap, we manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena, a real environment for web agents. To systematically examine the robustness of agents, we propose the Agent Robustness Evaluation (ARE) framework. ARE views the agent as a graph showing the flow of intermediate outputs between components and decomposes robustness as the flow of adversarial information on the graph. We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search. With imperceptible perturbations to a single image (less than 5% of total web page pixels), an attacker can hijack these agents to execute targeted adversarial goals with success rates up to 67%. We also use ARE to rigorously evaluate how the robustness changes as new components are added. We find that inference-time compute that typically improves benign performance can open up new vulnerabilities and harm robustness. An attacker can compromise the evaluator used by the reflexion agent and the value function of the tree search agent, which increases the attack success relatively by 15% and 20%. Our data and code for attacks, defenses, and evaluation are at https://github.com/ChenWu98/agent-attack

Problem

Research questions and friction points this paper is trying to address.

Evaluates adversarial robustness in multimodal language model agents.

Introduces Agent Robustness Evaluation (ARE) framework for systematic assessment.

Explores vulnerabilities from imperceptible perturbations in agent systems.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Manual adversarial tasks creation

Agent Robustness Evaluation framework

Graph-based adversarial flow analysis

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?