TRISHUL: Towards Region Identification and Screen Hierarchy Understanding for Large VLM based GUI Agents

📅 2025-02-12

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing GUI agents rely on dataset-specific training, while general-purpose large vision-language models (LVLMs) suffer from poor cross-platform action grounding—particularly in Screen-of-Memory (SoM)—due to missing HTML metadata, limiting generalization. TRISHUL addresses this with a zero-shot collaborative framework that unifies action grounding and GUI referring for the first time. It employs Hierarchical Screen Parsing (HSP) for screen-level structural understanding and Spatially Enhanced Element Description (SEED) for multi-granularity spatial-semantic joint modeling, with lightweight adaptation to SoM. Crucially, TRISHUL operates solely atop off-the-shelf LVLMs (e.g., GPT-4V), requiring no UI source code, fine-tuning, or task-specific training. Experiments demonstrate state-of-the-art performance across major action grounding benchmarks—including ScreenSpot, VisualWebBench, AITW, and Mind2Web—and surpass ToL on the ScreenPR referring task, establishing new SOTA.

Technology Category

Application Category

📝 Abstract

Recent advancements in Large Vision Language Models (LVLMs) have enabled the development of LVLM-based Graphical User Interface (GUI) agents under various paradigms. Training-based approaches, such as CogAgent and SeeClick, struggle with cross-dataset and cross-platform generalization due to their reliance on dataset-specific training. Generalist LVLMs, such as GPT-4V, employ Set-of-Marks (SoM) for action grounding, but obtaining SoM labels requires metadata like HTML source, which is not consistently available across platforms. Moreover, existing methods often specialize in singular GUI tasks rather than achieving comprehensive GUI understanding. To address these limitations, we introduce TRISHUL, a novel, training-free agentic framework that enhances generalist LVLMs for holistic GUI comprehension. Unlike prior works that focus on either action grounding (mapping instructions to GUI elements) or GUI referring (describing GUI elements given a location), TRISHUL seamlessly integrates both. At its core, TRISHUL employs Hierarchical Screen Parsing (HSP) and the Spatially Enhanced Element Description (SEED) module, which work synergistically to provide multi-granular, spatially, and semantically enriched representations of GUI elements. Our results demonstrate TRISHUL's superior performance in action grounding across the ScreenSpot, VisualWebBench, AITW, and Mind2Web datasets. Additionally, for GUI referring, TRISHUL surpasses the ToL agent on the ScreenPR benchmark, setting a new standard for robust and adaptable GUI comprehension.

Problem

Research questions and friction points this paper is trying to address.

Enhances GUI comprehension in LVLMs

Integrates action grounding and GUI referring

Improves cross-dataset and cross-platform generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free agentic framework for GUI

Hierarchical Screen Parsing (HSP) technique

Spatially Enhanced Element Description (SEED)

🔎 Similar Papers

Visual grounding for desktop graphical user interfaces