OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Large language model (LLM)-driven GUI agents—operating via screenshots or accessibility trees—lack systematic security evaluation. Method: We introduce OS-Harm, the first security benchmark for OS-level GUI interaction, covering three risk categories: deliberate misuse, prompt injection, and model misbehavior. It comprises 150 cross-application safety-violation tasks (e.g., email, browser, editor) and proposes a novel multi-dimensional threat taxonomy. We design an automated dual-metric evaluator measuring both functional accuracy and safety compliance, achieving F1 scores of 0.76 and 0.79 against human annotation. Implemented in the OSWorld environment with multimodal inputs, it empirically evaluates state-of-the-art models—including o4-mini, Claude 3.7 Sonnet, and Gemini 2.5 Pro—revealing pervasive misuse response rates, static prompt injection vulnerabilities, and sporadic unsafe actions. Contribution/Results: OS-Harm is open-sourced, establishing a standardized infrastructure for safety alignment and evaluation of GUI agents.

Technology Category

Application Category

📝 Abstract

Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption. To address this gap, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents. OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior. To cover these cases, we create 150 tasks that span several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and require the agent to interact with a variety of OS applications (email client, code editor, browser, etc.). Moreover, we propose an automated judge to evaluate both accuracy and safety of agents that achieves high agreement with human annotations (0.76 and 0.79 F1 score). We evaluate computer use agents based on a range of frontier models - such as o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro - and provide insights into their safety. In particular, all models tend to directly comply with many deliberate misuse queries, are relatively vulnerable to static prompt injections, and occasionally perform unsafe actions. The OS-Harm benchmark is available at https://github.com/tml-epfl/os-harm.

Problem

Research questions and friction points this paper is trying to address.

Evaluating safety risks of LLM-based computer use agents

Measuring harm potential in user misuse and attacks

Assessing agent safety across diverse OS applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

OS-Harm benchmark measures agent safety

Automated judge evaluates accuracy and safety

Tests 150 tasks across three harm categories

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?