MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing GUI localization methods suffer from low accuracy under visually cluttered conditions and ambiguous natural-language instructions, while also exhibiting high module coupling and poor generalization. To address these limitations, this paper proposes a multi-stage enhancement framework featuring bidirectional ROI scaling for coarse-grained region localization and a context-aware rewriting agent for fine-grained UI element identification. The framework establishes a modular, interpretable vision-language collaborative reasoning pipeline. Its core innovation lies in the synergistic integration of multi-scale visual analysis and semantic rewriting mechanisms. Evaluated on the ScreenSpot-Pro and OSWorld-G benchmarks, the method achieves state-of-the-art accuracy of 73.18% and 68.63%, respectively—substantially outperforming prior approaches. This work introduces a novel paradigm for precise mapping from natural language instructions to screen coordinates, advancing robustness, modularity, and interpretability in vision-language GUI understanding.

Technology Category

Application Category

📝 Abstract

Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot pipelines that lack modularity and fail under visual clutter and ambiguous instructions. We introduce MEGA-GUI, a multi-stage framework that separates grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding, orchestrated by specialized vision-language agents. MEGA-GUI features a bidirectional ROI zoom algorithm that mitigates spatial dilution and a context-aware rewriting agent that reduces semantic ambiguity. Our analysis reveals complementary strengths and weaknesses across vision-language models at different visual scales, and we show that leveraging this modular structure achieves consistently higher accuracy than monolithic approaches. On the visually dense ScreenSpot-Pro benchmark, MEGA-GUI attains 73.18% accuracy, and on the semantically complex OSWorld-G benchmark it reaches 68.63%, surpassing previously reported results. Code and the Grounding Benchmark Toolkit (GBT) are available at https://github.com/samsungsds-research-papers/mega-gui.

Problem

Research questions and friction points this paper is trying to address.

Mapping natural language to GUI coordinates for autonomous agents

Overcoming visual clutter and ambiguous instructions in interface grounding

Improving accuracy through multi-stage modular vision-language framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage framework separates coarse and fine grounding

Bidirectional ROI zoom algorithm reduces spatial dilution

Context-aware rewriting agent minimizes semantic ambiguity

🔎 Similar Papers

Visual grounding for desktop graphical user interfaces