You Don't Know Until You Click:Automated GUI Testing for Production-Ready Software Evaluation

📅 2025-08-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation methods—such as static analysis or binary script execution—fail to capture the real-world usability of LLM-generated GUI applications, as they cannot model interactive behavior or runtime dynamics. To address this, we propose RealDevWorld, the first end-to-end automated evaluation framework for production-grade GUI software. It comprises AppEvalPilot, an agent-based review system, and RealDevBench, a multi-domain task suite. The framework integrates multimodal task design, agent-driven user interaction simulation, runtime state monitoring, and visual comparison to jointly assess functional correctness, UI fidelity, and behavioral consistency. Experiments demonstrate an evaluation accuracy of 0.92 and a correlation of 0.85 with human expert judgments, significantly reducing manual effort. RealDevWorld enables scalable, fine-grained, and human-aligned assessment of GUI code generation capabilities.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) and code agents in software development are rapidly evolving from generating isolated code snippets to producing full-fledged software applications with graphical interfaces, interactive logic, and dynamic behaviors. However, current benchmarks fall short in evaluating such production-ready software, as they often rely on static checks or binary pass/fail scripts, failing to capture the interactive behaviors and runtime dynamics that define real-world usability - qualities that only emerge when an application is actively used. This is the blind spot of current evaluation: you don't know if an app works until you click through it, interact with it, and observe how it responds. To bridge this gap, we introduce RealDevWorld, a novel evaluation framework for automated end-to-end assessment of LLMs' ability to generate production-ready repositories from scratch. It features two key components: (1) RealDevBench, a diverse collection of 194 open-ended software engineering tasks across multiple domains, incorporating multimodal elements to reflect real-world complexity; and (2) AppEvalPilot, a new agent-as-a-judge evaluation system that simulates realistic, GUI-based user interactions to automatically and holistically assess software functional correctness, visual fidelity, and runtime behavior. The framework delivers fine-grained, task-specific diagnostic feedback, supporting nuanced evaluation beyond simple success/failure judgments. Empirical results show that RealDevWorld delivers effective, automatic, and human-aligned evaluations, achieving an accuracy of 0.92 and a correlation of 0.85 with expert human assessments, while significantly reducing the reliance on manual review. This enables scalable, human-aligned assessment of production-level software generated by LLMs. Our code is available on GitHub.
Problem

Research questions and friction points this paper is trying to address.

Evaluating interactive software generated by LLMs
Assessing GUI functionality and runtime behavior automatically
Bridging the gap in production-ready software evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated GUI testing framework
Agent-as-a-judge evaluation system
Multimodal software engineering tasks
🔎 Similar Papers
No similar papers found.