🤖 AI Summary
This study addresses the high cost of manually authoring and maintaining page objects in web end-to-end testing, a challenge exacerbated by the limited practicality of existing automation approaches. It presents the first systematic evaluation of large language models—specifically GPT-4o and DeepSeek Coder—for automatically generating page objects by leveraging both webpage structure and benchmark test data to guide the generation process. Experimental results demonstrate that the generated page objects are syntactically correct and functionally usable, achieving accuracy rates between 32.6% and 54.0%, with element identification rates consistently exceeding 70%. These findings validate the potential of large language models to enhance test maintainability and scalability, offering empirical evidence and practical guidance for integrating LLMs into real-world testing workflows.
📝 Abstract
Page Objects (POs) are a widely adopted design pattern for improving the maintainability and scalability of automated end-to-end web tests. However, creating and maintaining POs is still largely a manual, labor-intensive activity, while automated solutions have seen limited practical adoption. In this context, the potential of Large Language Models (LLMs) for these tasks has remained largely unexplored. This paper presents an empirical study on the feasibility of using LLMs, specifically GPT-4o and DeepSeek Coder, to automatically generate POs for web testing. We evaluate the generated artifacts on an existing benchmark of five web applications for which manually written POs are available (the ground truth), focusing on accuracy (i.e., the proportion of ground truth elements correctly identified) and element recognition rate (i.e., the proportion of ground truth elements correctly identified or marked for modification). Our results show that LLMs can generate syntactically correct and functionally useful POs with accuracy values ranging from 32.6% to 54.0% and element recognition rate exceeding 70% in most cases. Our study contributes the first systematic evaluation of LLMs strengths and open challenges for automated PO generation, and provides directions for further research on integrating LLMs into practical testing workflows.