TRISHUL: Towards Region Identification and Screen Hierarchy Understanding for Large VLM based GUI Agents

📅 2025-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing GUI agents rely on dataset-specific training, while general-purpose large vision-language models (LVLMs) suffer from poor cross-platform action grounding—particularly in Screen-of-Memory (SoM)—due to missing HTML metadata, limiting generalization. TRISHUL addresses this with a zero-shot collaborative framework that unifies action grounding and GUI referring for the first time. It employs Hierarchical Screen Parsing (HSP) for screen-level structural understanding and Spatially Enhanced Element Description (SEED) for multi-granularity spatial-semantic joint modeling, with lightweight adaptation to SoM. Crucially, TRISHUL operates solely atop off-the-shelf LVLMs (e.g., GPT-4V), requiring no UI source code, fine-tuning, or task-specific training. Experiments demonstrate state-of-the-art performance across major action grounding benchmarks—including ScreenSpot, VisualWebBench, AITW, and Mind2Web—and surpass ToL on the ScreenPR referring task, establishing new SOTA.

Technology Category

Application Category

📝 Abstract
Recent advancements in Large Vision Language Models (LVLMs) have enabled the development of LVLM-based Graphical User Interface (GUI) agents under various paradigms. Training-based approaches, such as CogAgent and SeeClick, struggle with cross-dataset and cross-platform generalization due to their reliance on dataset-specific training. Generalist LVLMs, such as GPT-4V, employ Set-of-Marks (SoM) for action grounding, but obtaining SoM labels requires metadata like HTML source, which is not consistently available across platforms. Moreover, existing methods often specialize in singular GUI tasks rather than achieving comprehensive GUI understanding. To address these limitations, we introduce TRISHUL, a novel, training-free agentic framework that enhances generalist LVLMs for holistic GUI comprehension. Unlike prior works that focus on either action grounding (mapping instructions to GUI elements) or GUI referring (describing GUI elements given a location), TRISHUL seamlessly integrates both. At its core, TRISHUL employs Hierarchical Screen Parsing (HSP) and the Spatially Enhanced Element Description (SEED) module, which work synergistically to provide multi-granular, spatially, and semantically enriched representations of GUI elements. Our results demonstrate TRISHUL's superior performance in action grounding across the ScreenSpot, VisualWebBench, AITW, and Mind2Web datasets. Additionally, for GUI referring, TRISHUL surpasses the ToL agent on the ScreenPR benchmark, setting a new standard for robust and adaptable GUI comprehension.
Problem

Research questions and friction points this paper is trying to address.

Enhances GUI comprehension in LVLMs
Integrates action grounding and GUI referring
Improves cross-dataset and cross-platform generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free agentic framework for GUI
Hierarchical Screen Parsing (HSP) technique
Spatially Enhanced Element Description (SEED)
🔎 Similar Papers
No similar papers found.
K
Kunal Singh
Fractal AI Research, India
Shreyas Singh
Shreyas Singh
Indian Institute of Technology Madras
Computer VisionDeep LearningComputational Imaging
M
Mukund Khanna
Fractal AI Research, India