POINTS-GUI-G: GUI-Grounding Journey

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the limited spatial awareness of foundational vision-language models in GUI element grounding tasks by proposing POINTS-GUI-G-8B, an end-to-end framework for intelligent GUI agent operation. The approach unifies multi-source data formats, employs difficulty-aware data curation and augmentation strategies, and ensures consistent resolution between training and inference. Notably, it introduces reinforcement learning with verifiable rewards to GUI grounding for the first time, significantly enhancing localization accuracy. The method achieves state-of-the-art performance across multiple benchmarks, attaining scores of 59.9, 66.0, 95.7, and 49.9 on ScreenSpot-Pro, OSWorld-G, ScreenSpot-v2, and UI-Vision, respectively.

Technology Category

Application Category

📝 Abstract

The rapid advancement of vision-language models has catalyzed the emergence of GUI agents, which hold immense potential for automating complex tasks, from online shopping to flight booking, thereby alleviating the burden of repetitive digital workflows. As a foundational capability, GUI grounding is typically established as a prerequisite for end-to-end task execution. It enables models to precisely locate interface elements, such as text and icons, to perform accurate operations like clicking and typing. Unlike prior works that fine-tune models already possessing strong spatial awareness (e.g., Qwen3-VL), we aim to master the full technical pipeline by starting from a base model with minimal grounding ability, such as POINTS-1.5. We introduce POINTS-GUI-G-8B, which achieves state-of-the-art performance with scores of 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision. Our model's success is driven by three key factors: (1) Refined Data Engineering, involving the unification of diverse open-source datasets format alongside sophisticated strategies for augmentation, filtering, and difficulty grading; (2) Improved Training Strategies, including continuous fine-tuning of the vision encoder to enhance perceptual accuracy and maintaining resolution consistency between training and inference; and (3) Reinforcement Learning (RL) with Verifiable Rewards. While RL is traditionally used to bolster reasoning, we demonstrate that it significantly improves precision in the perception-intensive GUI grounding task. Furthermore, GUI grounding provides a natural advantage for RL, as rewards are easily verifiable and highly accurate.

Problem

Research questions and friction points this paper is trying to address.

GUI grounding

vision-language models

element localization

task automation

perception-intensive tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

GUI grounding

data engineering

reinforcement learning