Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

📅 2024-10-24
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
Addressing fundamental challenges in cross-platform (iOS/Android/iPad/Web/Apple TV) UI understanding—including platform heterogeneity, variable screen resolutions, and scarce annotated data—this paper introduces UniUI, the first multimodal large language model designed for general-purpose UI understanding. Methodologically, UniUI features: (1) a unified cross-platform modeling architecture that eliminates platform-specific design; (2) an adaptive high-resolution visual encoder enabling native-scale perception; and (3) a GPT-4o–driven, set-of-mark–guided visual prompting framework for synthesizing high-quality, user-centric, multi-task instruction data. Evaluated across nine interactive tasks on five platforms, GUIDE action prediction, and the GUI-World benchmark, UniUI consistently outperforms Ferret-UI, demonstrating substantial gains in cross-platform generalization and robustness to high-resolution inputs.

Technology Category

Application Category

📝 Abstract
Building a generalist model for user interface (UI) understanding is challenging due to various foundational issues, such as platform diversity, resolution variation, and data limitation. In this paper, we introduce Ferret-UI 2, a multimodal large language model (MLLM) designed for universal UI understanding across a wide range of platforms, including iPhone, Android, iPad, Webpage, and AppleTV. Building on the foundation of Ferret-UI, Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting. These advancements enable Ferret-UI 2 to perform complex, user-centered interactions, making it highly versatile and adaptable for the expanding diversity of platform ecosystems. Extensive empirical experiments on referring, grounding, user-centric advanced tasks (comprising 9 subtasks $ imes$ 5 platforms), GUIDE next-action prediction dataset, and GUI-World multi-platform benchmark demonstrate that Ferret-UI 2 significantly outperforms Ferret-UI, and also shows strong cross-platform transfer capabilities.
Problem

Research questions and friction points this paper is trying to address.

Addresses platform diversity in UI understanding.
Overcomes resolution variation challenges in UI analysis.
Generates advanced task training data using GPT-4o.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal large language model for UI
High-resolution perception via adaptive scaling
Advanced task data generation with GPT-4o
🔎 Similar Papers
No similar papers found.