ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots

📅 2022-09-16
🏛️ arXiv.org
📈 Citations: 19
Influential: 3
📄 PDF
🤖 AI Summary
Existing screen understanding benchmarks predominantly focus on low-level UI parsing or high-level navigation tasks, lacking systematic evaluation of “screen reading comprehension.” ScreenQA addresses this gap by introducing the first large-scale benchmark specifically designed for screen reading comprehension: a mobile application screenshot dataset comprising 86K question-answer pairs, covering multimodal semantics—including text, icons, and layout. Its key contributions are: (1) the first formal definition of screen reading comprehension as a distinct subtask; (2) fine-grained UI content annotations with bounding boxes, enabling four downstream tasks—question answering, object localization, logical reasoning, and cross-platform transfer; and (3) empirical validation of positive transfer from mobile to web interfaces. Experiments across zero-shot, fine-tuning, and transfer learning paradigms demonstrate significant improvements in multimodal model performance and interpretability on screen QA, while maintaining robust cross-platform generalization.
📝 Abstract
We introduce ScreenQA, a novel benchmarking dataset designed to advance screen content understanding through question answering. The existing screen datasets are focused either on low-level structural and component understanding, or on a much higher-level composite task such as navigation and task completion for autonomous agents. ScreenQA attempts to bridge this gap. By annotating 86k question-answer pairs over the RICO dataset, we aim to benchmark the screen reading comprehension capacity, thereby laying the foundation for vision-based automation over screenshots. Our annotations encompass full answers, short answer phrases, and corresponding UI contents with bounding boxes, enabling four subtasks to address various application scenarios. We evaluate the dataset's efficacy using both open-weight and proprietary models in zero-shot, fine-tuned, and transfer learning settings. We further demonstrate positive transfer to web applications, highlighting its potential beyond mobile applications.
Problem

Research questions and friction points this paper is trying to address.

Advance screen content understanding
Benchmark screen reading comprehension
Enable vision-based automation over screenshots
Innovation

Methods, ideas, or system contributions that make the work stand out.

Annotates 86k QA pairs
Enables four subtasks
Evaluates in diverse settings
🔎 Similar Papers
2024-09-22arXiv.orgCitations: 13