Finetuning LLMs for Automatic Form Interaction on Web-Browser in Selenium Testing Framework

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) lack systematic evaluation benchmarks and high-quality datasets for generating Selenium-based form interaction test scripts. Method: This paper introduces FormBench, the first dedicated LLM benchmark for web form automation testing. We construct a hybrid dataset comprising synthetically generated and human-annotated examples covering diverse real-world scenarios, and define three core evaluation metrics: syntactic correctness, script executability, and field coverage. Leveraging this dataset, we train a specialized form interaction generation model via supervised fine-tuning to produce end-to-end executable Selenium test scripts. Contribution/Results: Experimental results demonstrate that our approach significantly outperforms strong baselines—including GPT-4o—across all metrics: 100% script executability, syntactic error rate below 0.5%, and average field coverage improved by 32.7%. FormBench thus establishes a foundational resource for advancing LLM-driven automated web testing.

Technology Category

Application Category

📝 Abstract
Automated web application testing is a critical component of modern software development, with frameworks like Selenium widely adopted for validating functionality through browser automation. Among the essential aspects of such testing is the ability to interact with and validate web forms, a task that requires syntactically correct, executable scripts with high coverage of input fields. Despite its importance, this task remains underexplored in the context of large language models (LLMs), and no public benchmark or dataset exists to evaluate LLMs on form interaction generation systematically. This paper introduces a novel method for training LLMs to generate high-quality test cases in Selenium, specifically targeting form interaction testing. We curate both synthetic and human-annotated datasets for training and evaluation, covering diverse real-world forms and testing scenarios. We define clear metrics for syntax correctness, script executability, and input field coverage. Our empirical study demonstrates that our approach significantly outperforms strong baselines, including GPT-4o and other popular LLMs, across all evaluation metrics. Our work lays the groundwork for future research on LLM-based web testing and provides resources to support ongoing progress in this area.
Problem

Research questions and friction points this paper is trying to address.

Automating web form interaction testing using LLMs in Selenium framework
Addressing lack of benchmarks for evaluating LLMs on form interaction generation
Generating executable test scripts with high syntax correctness and field coverage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning LLMs for Selenium form interaction testing
Using synthetic and human-annotated datasets for training
Defining metrics for syntax, executability, and coverage
🔎 Similar Papers
2024-05-16ACM Transactions on Software Engineering and MethodologyCitations: 4
Nguyen-Khang Le
Nguyen-Khang Le
Japan Advanced Institute of Science and Technology
Deep Learning
H
Hiep Nguyen
Japan Advanced Institute of Science and Technology, Ishikawa, Japan
M
Minh Ngoc Nguyen
Japan Advanced Institute of Science and Technology, Ishikawa, Japan
Son T. Luu
Son T. Luu
University of Information Technology, VNU-HCM
Natural language processingData ScienceKnowledge RepresentationArtificial Intelligence
T
Trung V o
Japan Advanced Institute of Science and Technology, Ishikawa, Japan
Q
Quan Minh Bui
Amifiable Inc., Tokyo, Japan
S
Shoshin Nomura
Amifiable Inc., Tokyo, Japan
L
Le-Minh Nguyen
Japan Advanced Institute of Science and Technology, Ishikawa, Japan