🤖 AI Summary
This work identifies a novel security threat—Malicious Image Patches (MIPs)—against multimodal operating system (OS) agents in screenshot-aware scenarios: imperceptible adversarial perturbations embedded in GUI screenshots mislead agents into invoking critical APIs, triggering unauthorized actions such as web navigation or command execution. To this end, the authors propose the first cross-task, cross-layout, and cross-agent universal attack framework targeting visual inputs, integrating adversarial example generation, GUI screenshot modeling, and API behavioral reverse engineering to construct an end-to-end MIP injection and triggering pipeline. Experiments demonstrate that the attack achieves an average success rate exceeding 92% across mainstream multimodal OS agents, exhibiting strong generalizability and high stealthiness. These results systematically expose a fundamental robustness gap in current multimodal OS agents: the absence of rigorous validation mechanisms for image-based inputs.
📝 Abstract
Recent advances in operating system (OS) agents enable vision-language models to interact directly with the graphical user interface of an OS. These multimodal OS agents autonomously perform computer-based tasks in response to a single prompt via application programming interfaces (APIs). Such APIs typically support low-level operations, including mouse clicks, keyboard inputs, and screenshot captures. We introduce a novel attack vector: malicious image patches (MIPs) that have been adversarially perturbed so that, when captured in a screenshot, they cause an OS agent to perform harmful actions by exploiting specific APIs. For instance, MIPs embedded in desktop backgrounds or shared on social media can redirect an agent to a malicious website, enabling further exploitation. These MIPs generalise across different user requests and screen layouts, and remain effective for multiple OS agents. The existence of such attacks highlights critical security vulnerabilities in OS agents, which should be carefully addressed before their widespread adoption.