Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

๐Ÿ“… 2026-02-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing intelligent agents rely on sparse and insufficiently diverse interface annotations, limiting their ability to comprehensively understand screen layouts and accurately execute instructions. To address this, this work proposes the first complete screen parsing supervision paradigm, introducing ScreenParseโ€”a dataset of 771,000 web screenshots with dense annotations for all visible UI elements, generated via an automated pipeline called Webshot. The authors design a structured markup representation, ScreenTag, along with a structure-aware loss function. A lightweight vision-language model, ScreenVLM, trained under this paradigm achieves a PageIoU of 0.592 on ScreenParse, substantially outperforming larger foundation models. Furthermore, when fine-tuned, ScreenVLM demonstrates superior generalization performance on public benchmarks.

Technology Category

Application Category

๐Ÿ“ Abstract
Modern computer-use agents (CUA) must perceive a screen as a structured state, what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen, which limits both coverage and generalization; moreover, practical deployment requires efficiency to enable low-latency, on-device use. We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse urls, extracts annotations and applies VLM-based relabeling and quality filtering. Using ScreenParse, we train ScreenVLM, a compact, 316M-parameter vision language model (VLM) that decodes a compact ScreenTag markup representation with a structure-aware loss that upweights structure-critical tokens. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.592 vs. 0.294 PageIoU on ScreenParse) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance, suggesting that dense screen supervision provides transferable structural priors for UI understanding. Project page: https://saidgurbuz.github.io/screenparse/.
Problem

Research questions and friction points this paper is trying to address.

sparse supervision
UI understanding
screen parsing
grounding
dense annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

complete screen parsing
dense UI annotation
structure-aware VLM
ScreenTag markup
transferable structural priors
๐Ÿ”Ž Similar Papers
No similar papers found.
A
A. Said Gurbuz
IBM Research Zurich, Zurich, Switzerland
Sunghwan Hong
Sunghwan Hong
Postdoc @ ETHZ
Computer Vision3D VisionSfMSLAM
A
Ahmed Nassar
IBM Research Zurich, Zurich, Switzerland
Marc Pollefeys
Marc Pollefeys
Professor of Computer Science, ETH Zurich, and Director Spatial AI Lab, Microsoft
Computer VisionComputer GraphicsRoboticsMachine LearningAugmented Reality
P
Peter Staar
IBM Research Zurich, Zurich, Switzerland