ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots

📅 2022-09-16

🏛️ arXiv.org

📈 Citations: 19

✨ Influential: 3

career value

172K/year

🤖 AI Summary

Existing screen understanding benchmarks predominantly focus on low-level UI parsing or high-level navigation tasks, lacking systematic evaluation of “screen reading comprehension.” ScreenQA addresses this gap by introducing the first large-scale benchmark specifically designed for screen reading comprehension: a mobile application screenshot dataset comprising 86K question-answer pairs, covering multimodal semantics—including text, icons, and layout. Its key contributions are: (1) the first formal definition of screen reading comprehension as a distinct subtask; (2) fine-grained UI content annotations with bounding boxes, enabling four downstream tasks—question answering, object localization, logical reasoning, and cross-platform transfer; and (3) empirical validation of positive transfer from mobile to web interfaces. Experiments across zero-shot, fine-tuning, and transfer learning paradigms demonstrate significant improvements in multimodal model performance and interpretability on screen QA, while maintaining robust cross-platform generalization.

📝 Abstract

We introduce ScreenQA, a novel benchmarking dataset designed to advance screen content understanding through question answering. The existing screen datasets are focused either on low-level structural and component understanding, or on a much higher-level composite task such as navigation and task completion for autonomous agents. ScreenQA attempts to bridge this gap. By annotating 86k question-answer pairs over the RICO dataset, we aim to benchmark the screen reading comprehension capacity, thereby laying the foundation for vision-based automation over screenshots. Our annotations encompass full answers, short answer phrases, and corresponding UI contents with bounding boxes, enabling four subtasks to address various application scenarios. We evaluate the dataset's efficacy using both open-weight and proprietary models in zero-shot, fine-tuned, and transfer learning settings. We further demonstrate positive transfer to web applications, highlighting its potential beyond mobile applications.

Problem

Research questions and friction points this paper is trying to address.

Advance screen content understanding

Benchmark screen reading comprehension

Enable vision-based automation over screenshots

Innovation

Methods, ideas, or system contributions that make the work stand out.

Annotates 86k QA pairs

Enables four subtasks

Evaluates in diverse settings

🔎 Similar Papers

MobileViews: A Large-Scale Mobile GUI Dataset