SparkUI-Parser: Enhancing GUI Perception with Robust Grounding and Parsing

📅 2025-09-05

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) face two key bottlenecks in GUI perception: (1) reliance on text-based autoregressive modeling of discrete coordinates, resulting in low localization accuracy and slow inference; and (2) support only for predefined UI element categories, limiting full-interface, fine-grained parsing. This work proposes an end-to-end continuous coordinate modeling paradigm that eliminates discrete tokenization. We design a robust Hungarian matching algorithm augmented with a rejection mechanism to improve generalization and unseen-element recognition. Additionally, we introduce a lightweight architecture comprising a token router and a coordinate decoder. Evaluated on four benchmarks—ScreenSpot, ScreenSpot-v2, CAGUI-Grounding, and ScreenParse—our method achieves significant improvements over state-of-the-art approaches: up to 32.7% reduction in false detection rate, 19.4% gain in parsing completeness, and 2.1× faster inference speed.

Technology Category

Application Category

📝 Abstract

The existing Multimodal Large Language Models (MLLMs) for GUI perception have made great progress. However, the following challenges still exist in prior methods: 1) They model discrete coordinates based on text autoregressive mechanism, which results in lower grounding accuracy and slower inference speed. 2) They can only locate predefined sets of elements and are not capable of parsing the entire interface, which hampers the broad application and support for downstream tasks. To address the above issues, we propose SparkUI-Parser, a novel end-to-end framework where higher localization precision and fine-grained parsing capability of the entire interface are simultaneously achieved. Specifically, instead of using probability-based discrete modeling, we perform continuous modeling of coordinates based on a pre-trained Multimodal Large Language Model (MLLM) with an additional token router and coordinate decoder. This effectively mitigates the limitations inherent in the discrete output characteristics and the token-by-token generation process of MLLMs, consequently boosting both the accuracy and the inference speed. To further enhance robustness, a rejection mechanism based on a modified Hungarian matching algorithm is introduced, which empowers the model to identify and reject non-existent elements, thereby reducing false positives. Moreover, we present ScreenParse, a rigorously constructed benchmark to systematically assess structural perception capabilities of GUI models across diverse scenarios. Extensive experiments demonstrate that our approach consistently outperforms SOTA methods on ScreenSpot, ScreenSpot-v2, CAGUI-Grounding and ScreenParse benchmarks. The resources are available at https://github.com/antgroup/SparkUI-Parser.

Problem

Research questions and friction points this paper is trying to address.

Improving GUI element localization accuracy and speed

Enabling full interface parsing beyond predefined elements

Reducing false positives with robust rejection mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous coordinate modeling with token router

Rejection mechanism using Hungarian matching

End-to-end framework for fine-grained parsing

🔎 Similar Papers

Visual grounding for desktop graphical user interfaces