🤖 AI Summary
Existing web agents are confined to operating at the human interface layer, lacking direct access to application logic, which limits their robustness and action expressiveness. This work proposes a lightweight embedded agent architecture that enables coordinated control over both frontend and backend components through frontend hooks and reusable backend workflows. The architecture uniquely supports unified integration across diverse frontend frameworks—such as React and Angular—and combines ARIA and URL observation, a page function registry, WebSocket communication, and MCP tool invocation to enable mixed-granularity actions and multi-step task execution. Evaluated in real-world web environments, the approach demonstrates that stable and complex agent behaviors can be deployed with minimal modification overhead, confirming its generality and practical utility.
📝 Abstract
Most web agents operate at the human interface level, observing screenshots or raw DOM trees without application-level access, which limits robustness and action expressiveness. In enterprise settings, however, explicit control of both the frontend and backend is available. We present EmbeWebAgent, a framework for embedding agents directly into existing UIs using lightweight frontend hooks (curated ARIA and URL-based observations, and a per-page function registry exposed via a WebSocket) and a reusable backend workflow that performs reasoning and takes actions. EmbeWebAgent is stack-agnostic (e.g., React or Angular), supports mixed-granularity actions ranging from GUI primitives to higher-level composites, and orchestrates navigation, manipulation, and domain-specific analytics via MCP tools. Our demo shows minimal retrofitting effort and robust multi-step behaviors grounded in a live UI setting. Live Demo: https://youtu.be/Cy06Ljee1JQ