CommonForms: A Large, Diverse Dataset for Form Field Detection

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing form field detection datasets suffer from limited scale and insufficient diversity, hindering robust model development. Method: We introduce CommonForms—the first large-scale, open-source benchmark for form field detection—constructed from Common Crawl web data, covering multiple languages and domains, with 450K high-quality annotated pages. We unify form field detection as an end-to-end object detection task, supporting joint recognition of text fields, selection widgets (including checkboxes), signature lines, and other field types. Our pipeline incorporates high-resolution image inputs and a rule-based PDF cleaning procedure, and we propose lightweight detection models, FFDNet-Small and FFDNet-Large. Contribution/Results: On the CommonForms test set, our models achieve state-of-the-art mean Average Precision (mAP) at under $500 training cost—significantly outperforming mainstream commercial PDF tools, especially on non-English and cross-domain forms, thereby enhancing robustness and generalization.

Technology Category

Application Category

📝 Abstract
This paper introduces CommonForms, a web-scale dataset for form field detection. It casts the problem of form field detection as object detection: given an image of a page, predict the location and type (Text Input, Choice Button, Signature) of form fields. The dataset is constructed by filtering Common Crawl to find PDFs that have fillable elements. Starting with 8 million documents, the filtering process is used to arrive at a final dataset of roughly 55k documents that have over 450k pages. Analysis shows that the dataset contains a diverse mixture of languages and domains; one third of the pages are non-English, and among the 14 classified domains, no domain makes up more than 25% of the dataset. In addition, this paper presents a family of form field detectors, FFDNet-Small and FFDNet-Large, which attain a very high average precision on the CommonForms test set. Each model cost less than $500 to train. Ablation results show that high-resolution inputs are crucial for high-quality form field detection, and that the cleaning process improves data efficiency over using all PDFs that have fillable fields in Common Crawl. A qualitative analysis shows that they outperform a popular, commercially available PDF reader that can prepare forms. Unlike the most popular commercially available solutions, FFDNet can predict checkboxes in addition to text and signature fields. This is, to our knowledge, the first large scale dataset released for form field detection, as well as the first open source models. The dataset, models, and code will be released at https://github.com/jbarrow/commonforms
Problem

Research questions and friction points this paper is trying to address.

Detecting form field locations and types from page images
Creating a large diverse dataset for form field detection
Developing efficient models for accurate form field identification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Form field detection as object detection problem
FFDNet models trained for under $500 each
High-resolution inputs crucial for detection quality