🤖 AI Summary
Existing methods for GUI element localization either rely heavily on large amounts of annotated data for fine-tuning, resulting in poor generalization, or suffer from insufficient reliability due to the absence of explicit spatial anchors. This work proposes Trifuse, a novel framework that, for the first time, integrates OCR-extracted text and icon-level semantics as complementary spatial anchors within an attention mechanism. It further introduces a Consensus-SinglePeak fusion strategy to achieve high-precision localization without requiring task-specific fine-tuning. Evaluated across four GUI benchmarks, Trifuse substantially outperforms current zero-shot localization approaches, significantly reducing dependence on costly labeled data while delivering consistent performance gains across diverse model backbones.
📝 Abstract
GUI grounding maps natural language instructions to the correct interface elements, serving as the perception foundation for GUI agents. Existing approaches predominantly rely on fine-tuning multimodal large language models (MLLMs) using large-scale GUI datasets to predict target element coordinates, which is data-intensive and generalizes poorly to unseen interfaces. Recent attention-based alternatives exploit localization signals in MLLMs attention mechanisms without task-specific fine-tuning, but suffer from low reliability due to the lack of explicit and complementary spatial anchors in GUI images. To address this limitation, we propose Trifuse, an attention-based grounding framework that explicitly integrates complementary spatial anchors. Trifuse integrates attention, OCR-derived textual cues, and icon-level caption semantics via a Consensus-SinglePeak (CS) fusion strategy that enforces cross-modal agreement while retaining sharp localization peaks. Extensive evaluations on four grounding benchmarks demonstrate that Trifuse achieves strong performance without task-specific fine-tuning, substantially reducing the reliance on expensive annotated data. Moreover, ablation studies reveal that incorporating OCR and caption cues consistently improves attention-based grounding performance across different backbones, highlighting its effectiveness as a general framework for GUI grounding.