SparkUI-Parser: Enhancing GUI Perception with Robust Grounding and Parsing

📅 2025-09-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) face two key bottlenecks in GUI perception: (1) reliance on text-based autoregressive modeling of discrete coordinates, resulting in low localization accuracy and slow inference; and (2) support only for predefined UI element categories, limiting full-interface, fine-grained parsing. This work proposes an end-to-end continuous coordinate modeling paradigm that eliminates discrete tokenization. We design a robust Hungarian matching algorithm augmented with a rejection mechanism to improve generalization and unseen-element recognition. Additionally, we introduce a lightweight architecture comprising a token router and a coordinate decoder. Evaluated on four benchmarks—ScreenSpot, ScreenSpot-v2, CAGUI-Grounding, and ScreenParse—our method achieves significant improvements over state-of-the-art approaches: up to 32.7% reduction in false detection rate, 19.4% gain in parsing completeness, and 2.1× faster inference speed.

Technology Category

Application Category

📝 Abstract
The existing Multimodal Large Language Models (MLLMs) for GUI perception have made great progress. However, the following challenges still exist in prior methods: 1) They model discrete coordinates based on text autoregressive mechanism, which results in lower grounding accuracy and slower inference speed. 2) They can only locate predefined sets of elements and are not capable of parsing the entire interface, which hampers the broad application and support for downstream tasks. To address the above issues, we propose SparkUI-Parser, a novel end-to-end framework where higher localization precision and fine-grained parsing capability of the entire interface are simultaneously achieved. Specifically, instead of using probability-based discrete modeling, we perform continuous modeling of coordinates based on a pre-trained Multimodal Large Language Model (MLLM) with an additional token router and coordinate decoder. This effectively mitigates the limitations inherent in the discrete output characteristics and the token-by-token generation process of MLLMs, consequently boosting both the accuracy and the inference speed. To further enhance robustness, a rejection mechanism based on a modified Hungarian matching algorithm is introduced, which empowers the model to identify and reject non-existent elements, thereby reducing false positives. Moreover, we present ScreenParse, a rigorously constructed benchmark to systematically assess structural perception capabilities of GUI models across diverse scenarios. Extensive experiments demonstrate that our approach consistently outperforms SOTA methods on ScreenSpot, ScreenSpot-v2, CAGUI-Grounding and ScreenParse benchmarks. The resources are available at https://github.com/antgroup/SparkUI-Parser.
Problem

Research questions and friction points this paper is trying to address.

Improving GUI element localization accuracy and speed
Enabling full interface parsing beyond predefined elements
Reducing false positives with robust rejection mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous coordinate modeling with token router
Rejection mechanism using Hungarian matching
End-to-end framework for fine-grained parsing
🔎 Similar Papers
No similar papers found.
H
Hongyi Jing
Ant Group
J
Jiafu Chen
Zhejiang University
C
Chen Rao
Zhejiang University
Ziqiang Dang
Ziqiang Dang
Ant Group
Computer VisionComputer GraphicsMLLM
J
Jiajie Teng
Ant Group
T
Tianyi Chu
Zhejiang University
J
Juncheng Mo
Zhejiang University
S
Shuo Fang
Ant Group
H
Huaizhong Lin
Zhejiang University
R
Rui Lv
Ant Group
C
Chenguang Ma
Ant Group
L
Lei Zhao
Zhejiang University