Toward Autonomous UI Exploration: The UIExplorer Benchmark

📅 2025-06-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of standardized evaluation for autonomous agents’ UI exploration capabilities by introducing UIExplore-Bench—the first dedicated benchmark for this task. It features a three-tier GitLab sandbox environment supporting dual-modality assessment: structured (DOM-based) and purely visual (GUI screenshot-based). A novel metric, hUFO, is proposed to quantify the discovery of interactive UI components. Evaluation integrates DOM parsing, human-like interaction simulation, and functional coverage analysis to systematically compare agent performance against human experts. Experiments show that UIExplore-AlGo achieves 77.2% and 59.0% of human-normalized performance within 2000 steps under structured and visual modalities, respectively—significantly outperforming baseline methods, especially under sparse feedback conditions. All code, datasets, and environments are publicly released, establishing a reproducible infrastructure for UI exploration research.

Technology Category

Application Category

📝 Abstract
Autonomous agents must know how to explore user interfaces (UIs) for reliable task solving, yet systematic evaluation of this crucial phase is lacking. We introduce UIExplore-Bench, the first benchmark explicitly dedicated to UI exploration. The benchmark evaluates agents with either Structured mode (granting access to layout information like DOM trees) or Screen mode (relying on GUI-only observations such as screenshots and human-like mouse/keyboard interactions) across three levels in a standardized GitLab sandbox environment. We formalize exploration as the process of maximizing the set of actionable UI components discovered and propose a metric, human-normalized UI-Functionalities Observed (hUFO), to quantify the effectiveness of exploration. Our results show that UIExplore-AlGo achieves the leading mean hUFO scores, reaching up to 77.2% of human performance in Structured mode and 59.0% in Screen mode at 2,000 steps, particularly excelling at the Sparse level. The results highlight the relevance of our benchmark, as current agents show a substantial performance gap compared to one hour of human expert exploration, indicating ample room for future advancements. We publicly release the benchmark environment, an exploration dataset, and an evaluation suite to catalyze research into efficient UI exploration strategies and their downstream applications, such as experience-driven task completion and automated training data generation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating autonomous agents' UI exploration capabilities systematically
Measuring exploration effectiveness with human-normalized hUFO metric
Addressing performance gap between agents and human experts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces UIExplore-Bench for UI exploration evaluation
Uses Structured and Screen modes for agent assessment
Proposes hUFO metric to quantify exploration effectiveness
🔎 Similar Papers
No similar papers found.