FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Form-filling automation has long been hindered by the poor generalizability of rule-based systems, while current multimodal large language models (MLLMs) suffer from critical deficiencies in GUI layout understanding and precise instruction-to-field alignment. To address this, we introduce FormBench—the first interactive, multimodal benchmark for form filling—comprising 12 real-world websites, diverse field types, and high-fidelity user interaction traces. We formally define the task and propose an end-to-end evaluation framework integrating web interface simulation, structured semantic annotation, GUI action modeling, and automated assessment. Experiments reveal that state-of-the-art MLLMs achieve less than 5% accuracy. We systematically identify three dominant failure modes: visual localization errors, field-type misclassification, and instruction-field mismatch—highlighting fundamental bottlenecks in layout reasoning and cross-modal alignment.

Technology Category

Application Category

📝 Abstract
Online form filling is a common yet labor-intensive task involving extensive keyboard and mouse interactions. Despite the long-standing vision of automating this process with"one click", existing tools remain largely rule-based and lack generalizable, generative capabilities. Recent advances in Multimodal Large Language Models (MLLMs) have enabled promising agents for GUI-related tasks in general-purpose scenarios. However, they struggle with the unique challenges of form filling, such as flexible layouts and the difficulty of aligning textual instructions with on-screen fields. To bridge this gap, we formally define the form-filling task and propose FormFactory, an interactive benchmarking suite comprising a web-based interface, backend evaluation module, and carefully constructed dataset. Our benchmark covers diverse real-world scenarios, incorporates various field formats, and simulates high-fidelity form interactions. We conduct a comprehensive evaluation of state-of-the-art MLLMs and observe that no model surpasses 5% accuracy, underscoring the inherent difficulty of the task. These findings also reveal significant limitations in current models' visual layout reasoning and field-value alignment abilities. We hope our benchmark can serve as a stepping stone for further research into robust, practical form-filling agents.
Problem

Research questions and friction points this paper is trying to address.

Automating labor-intensive online form filling tasks
Addressing challenges in aligning text instructions with form fields
Improving visual layout reasoning in multimodal language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interactive benchmarking suite for form-filling agents
Multimodal Large Language Models for GUI tasks
High-fidelity form interaction simulation
🔎 Similar Papers
No similar papers found.