Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions

📅 2024-08-05

🏛️ arXiv.org

📈 Citations: 26

✨ Influential: 5

career value

203K/year

🤖 AI Summary

Multimodal large language models (MLLMs) deployed as GUI agents exhibit significant environmental fragility in real-world interfaces—non-malicious, irrelevant UI elements (e.g., ads, decorative icons) substantially degrade action fidelity. Method: We construct a GUI simulation environment and introduce the first non-adversarial environmental injection evaluation paradigm, systematically perturbing inputs across three perceptual modalities: visual, vision-language, and instruction-based. We conduct cross-model benchmarking on both general-purpose and GUI-specialized MLLMs. Results: All evaluated models—including the state-of-the-art GUI-dedicated agent—consistently exhibit distraction behaviors, with statistically significant increases in erroneous actions. This work provides the first systematic empirical demonstration of environmental robustness deficiencies in MLLM-based GUI agents, exposing critical safety risks for real-world deployment. It further establishes a foundational evaluation framework and evidence base to guide the development of trustworthy multimodal agents.

Technology Category

Application Category

📝 Abstract

This paper investigates the faithfulness of multimodal large language model (MLLM) agents in the graphical user interface (GUI) environment, aiming to address the research question of whether multimodal GUI agents can be distracted by environmental context. A general setting is proposed where both the user and the agent are benign, and the environment, while not malicious, contains unrelated content. A wide range of MLLMs are evaluated as GUI agents using our simulated dataset, following three working patterns with different levels of perception. Experimental results reveal that even the most powerful models, whether generalist agents or specialist GUI agents, are susceptible to distractions. While recent studies predominantly focus on the helpfulness (i.e., action accuracy) of multimodal agents, our findings indicate that these agents are prone to environmental distractions, resulting in unfaithful behaviors. Furthermore, we switch to the adversarial perspective and implement environment injection, demonstrating that such unfaithfulness can be exploited, leading to unexpected risks.

Problem

Research questions and friction points this paper is trying to address.

Investigates MLLM agents' susceptibility to GUI environmental distractions

Evaluates MLLMs' performance in benign but distracting GUI contexts

Proposes adversarial injection to improve agent faithfulness against distractions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates MLLM agents in GUI environments

Proposes adversarial environment injection method

Analyzes distractions to improve agent faithfulness

🔎 Similar Papers

A Role of Environmental Complexity on Representation Learning in Deep Reinforcement Learning Agents