🤖 AI Summary
Existing LLM-driven automation methods exhibit significant limitations in execution consistency, precise identification of critical DOM elements, and evaluability when applied to poorly designed, structurally irregular web interfaces within enterprise intranets. To address these challenges, this paper proposes: (1) a standardized operational workflow generation mechanism that reliably transforms demonstrations into robust, executable instructions; (2) a high-precision HTML element localization model integrating both semantic and structural features; and (3) a behavior-trajectory-based quantitative evaluation framework for measuring execution consistency. Evaluated on an internal benchmark, our approach improves task success rate from 72.0% to 88.68% and achieves 84.7% accuracy in operation pattern recognition. These advances substantially enhance the stability, interpretability, and assessability of AI agents in real-world industrial environments.
📝 Abstract
The emergence of AI-driven web automation through Large Language Models (LLMs) offers unprecedented opportunities for optimizing digital workflows. However, deploying such systems within industry's real-world environments presents four core challenges: (1) ensuring consistent execution, (2) accurately identifying critical HTML elements, (3) meeting human-like accuracy in order to automate operations at scale and (4) the lack of comprehensive benchmarking data on internal web applications. Existing solutions are primarily tailored for well-designed, consumer-facing websites (e.g., Amazon.com, Apple.com) and fall short in addressing the complexity of poorly-designed internal web interfaces. To address these limitations, we present Cybernaut, a novel framework to ensure high execution consistency in web automation agents designed for robust enterprise use. Our contributions are threefold: (1) a Standard Operating Procedure (SOP) generator that converts user demonstrations into reliable automation instructions for linear browsing tasks, (2) a high-precision HTML DOM element recognition system tailored for the challenge of complex web interfaces, and (3) a quantitative metric to assess execution consistency. The empirical evaluation on our internal benchmark demonstrates that using our framework enables a 23.2% improvement (from 72% to 88.68%) in task execution success rate over the browser_use. Cybernaut identifies consistent execution patterns with 84.7% accuracy, enabling reliable confidence assessment and adaptive guidance during task execution in real-world systems. These results highlight Cybernaut's effectiveness in enterprise-scale web automation and lay a foundation for future advancements in web automation.