🤖 AI Summary
This work addresses the lack of systematic evaluation for operating system (OS) agents, noting that existing benchmarks fall short in assessing safety coverage, trajectory quality, and robustness. To bridge this gap, the authors propose the first four-dimensional evaluation framework tailored for OS agents, encompassing safety (including environmental and human-induced risks), performance (based on trajectory value estimation), efficiency (balancing time and token consumption), and robustness (evaluated under cross-modal perturbations). The framework includes curated evaluation subsets and automated diagnostic tools. Comprehensive evaluation of 22 state-of-the-art agents reveals a prevalent trade-off between efficiency and safety or robustness, demonstrates the superiority of specialized models over general-purpose ones, and highlights significant disparities in multimodal robustness. This study establishes a standardized benchmark and provides multidimensional rankings to guide future research.
📝 Abstract
The evolution of Multimodal Large Language Models (MLLMs) has shifted the focus from text generation to active behavioral execution, particularly via OS agents navigating complex GUIs. However, the transition of these agents into trustworthy daily partners is hindered by a lack of rigorous evaluation regarding safety, efficiency, and multi-modal robustness. Current benchmarks suffer from narrow safety scenarios, noisy trajectory labeling, and limited robustness metrics. To bridge this gap, we propose OS-SPEAR, a comprehensive toolkit for the systematic analysis of OS agents across four dimensions: Safety, Performance, Efficiency, and Robustness. OS-SPEAR introduces four specialized subsets: (1) a S(afety)-subset encompassing diverse environment- and human-induced hazards; (2) a P(erformance)-subset curated via trajectory value estimation and stratified sampling; (3) an E(fficiency)-subset quantifying performance through the dual lenses of temporal latency and token consumption; and (4) a R(obustness)-subset that applies cross-modal disturbances to both visual and textual inputs. Additionally, we provide an automated analysis tool to generate human-readable diagnostic reports. We conduct an extensive evaluation of 22 popular OS agents using OS-SPEAR. Our empirical results reveal critical insights into the current landscape: notably, a prevalent trade-off between efficiency and safety or robustness, the performance superiority of specialized agents over general-purpose models, and varying robustness vulnerabilities across different modalities. By providing a multidimensional ranking and a standardized evaluation framework, OS-SPEAR offers a foundational resource for developing the next generation of reliable and efficient OS agents. The dataset and codes are available at https://github.com/Wuzheng02/OS-SPEAR.