🤖 AI Summary
Existing GUI localization methods suffer from low accuracy under visually cluttered conditions and ambiguous natural-language instructions, while also exhibiting high module coupling and poor generalization. To address these limitations, this paper proposes a multi-stage enhancement framework featuring bidirectional ROI scaling for coarse-grained region localization and a context-aware rewriting agent for fine-grained UI element identification. The framework establishes a modular, interpretable vision-language collaborative reasoning pipeline. Its core innovation lies in the synergistic integration of multi-scale visual analysis and semantic rewriting mechanisms. Evaluated on the ScreenSpot-Pro and OSWorld-G benchmarks, the method achieves state-of-the-art accuracy of 73.18% and 68.63%, respectively—substantially outperforming prior approaches. This work introduces a novel paradigm for precise mapping from natural language instructions to screen coordinates, advancing robustness, modularity, and interpretability in vision-language GUI understanding.
📝 Abstract
Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot pipelines that lack modularity and fail under visual clutter and ambiguous instructions. We introduce MEGA-GUI, a multi-stage framework that separates grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding, orchestrated by specialized vision-language agents. MEGA-GUI features a bidirectional ROI zoom algorithm that mitigates spatial dilution and a context-aware rewriting agent that reduces semantic ambiguity. Our analysis reveals complementary strengths and weaknesses across vision-language models at different visual scales, and we show that leveraging this modular structure achieves consistently higher accuracy than monolithic approaches. On the visually dense ScreenSpot-Pro benchmark, MEGA-GUI attains 73.18% accuracy, and on the semantically complex OSWorld-G benchmark it reaches 68.63%, surpassing previously reported results. Code and the Grounding Benchmark Toolkit (GBT) are available at https://github.com/samsungsds-research-papers/mega-gui.