🤖 AI Summary
Form-filling automation has long been hindered by the poor generalizability of rule-based systems, while current multimodal large language models (MLLMs) suffer from critical deficiencies in GUI layout understanding and precise instruction-to-field alignment. To address this, we introduce FormBench—the first interactive, multimodal benchmark for form filling—comprising 12 real-world websites, diverse field types, and high-fidelity user interaction traces. We formally define the task and propose an end-to-end evaluation framework integrating web interface simulation, structured semantic annotation, GUI action modeling, and automated assessment. Experiments reveal that state-of-the-art MLLMs achieve less than 5% accuracy. We systematically identify three dominant failure modes: visual localization errors, field-type misclassification, and instruction-field mismatch—highlighting fundamental bottlenecks in layout reasoning and cross-modal alignment.
📝 Abstract
Online form filling is a common yet labor-intensive task involving extensive keyboard and mouse interactions. Despite the long-standing vision of automating this process with"one click", existing tools remain largely rule-based and lack generalizable, generative capabilities. Recent advances in Multimodal Large Language Models (MLLMs) have enabled promising agents for GUI-related tasks in general-purpose scenarios. However, they struggle with the unique challenges of form filling, such as flexible layouts and the difficulty of aligning textual instructions with on-screen fields. To bridge this gap, we formally define the form-filling task and propose FormFactory, an interactive benchmarking suite comprising a web-based interface, backend evaluation module, and carefully constructed dataset. Our benchmark covers diverse real-world scenarios, incorporates various field formats, and simulates high-fidelity form interactions. We conduct a comprehensive evaluation of state-of-the-art MLLMs and observe that no model surpasses 5% accuracy, underscoring the inherent difficulty of the task. These findings also reveal significant limitations in current models' visual layout reasoning and field-value alignment abilities. We hope our benchmark can serve as a stepping stone for further research into robust, practical form-filling agents.