WinDeskGround: A Benchmark for Robust GUI Grounding in Complex Multi-Window Desktop Environments

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Current GUI agents exhibit insufficient robustness in real-world multi-window desktop environments due to occlusion, dense layouts, and visual clutter. This work proposes the first benchmark specifically designed to evaluate the robustness of GUI grounding under such challenging conditions. The benchmark programmatically generates high-fidelity, diverse test scenarios with controllable levels of window occlusion, layout density, and semantic similarity to simulate distribution shifts encountered in authentic user workflows. Leveraging a curated metadata set of 1,356 high-quality instruction–target pairs, experiments reveal that while state-of-the-art multimodal large language models perform well on simplified interfaces, their grounding accuracy drops significantly under partial occlusion. These findings underscore the effectiveness and necessity of the proposed benchmark for rigorously assessing GUI agent robustness.

📝 Abstract

Multimodal Large Language Models (MLLMs) have revolutionized GUI automation, yet their efficacy is largely established on idealized, single-layer interfaces. This paper identifies a critical reliability gap: state-of-the-art agents face distinct robustness challenges in real-world desktop environments characterized by multi-window stacking, occlusion, and visual clutter. To address this, we introduce WinDeskGround, a novel benchmark and synthesis framework tailored for evaluating GUI grounding robustness. Unlike static datasets, our framework parametrically generates complex desktop scenarios by controlling window occlusion, layout density, and semantic similarity, thereby simulating the distribution shifts of authentic workflows. We construct a diverse meta-dataset of 1,356 high-fidelity instruction-target pairs and conduct comprehensive evaluations of five leading MLLMs. Our results demonstrate that while top-tier agents excel in simplified settings, their accuracy declines under partial occlusion. WinDeskGround provides a valuable benchmark to facilitate the assessment and advancement of GUI agent robustness in realistic environments. The code is available at https://github.com/ZZZhr-1/WinDeskGround.

Problem

Research questions and friction points this paper is trying to address.

GUI grounding

robustness

multi-window environments

occlusion

desktop automation

Innovation

Methods, ideas, or system contributions that make the work stand out.

GUI grounding

multimodal large language models

robustness benchmark

multi-window desktop

synthetic data generation

🔎 Similar Papers

Visual grounding for desktop graphical user interfaces

2024-05-05Citations: 1

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

2024-10-07arXiv.orgCitations: 17