🤖 AI Summary
Addressing fundamental challenges in cross-platform (iOS/Android/iPad/Web/Apple TV) UI understanding—including platform heterogeneity, variable screen resolutions, and scarce annotated data—this paper introduces UniUI, the first multimodal large language model designed for general-purpose UI understanding. Methodologically, UniUI features: (1) a unified cross-platform modeling architecture that eliminates platform-specific design; (2) an adaptive high-resolution visual encoder enabling native-scale perception; and (3) a GPT-4o–driven, set-of-mark–guided visual prompting framework for synthesizing high-quality, user-centric, multi-task instruction data. Evaluated across nine interactive tasks on five platforms, GUIDE action prediction, and the GUI-World benchmark, UniUI consistently outperforms Ferret-UI, demonstrating substantial gains in cross-platform generalization and robustness to high-resolution inputs.
📝 Abstract
Building a generalist model for user interface (UI) understanding is challenging due to various foundational issues, such as platform diversity, resolution variation, and data limitation. In this paper, we introduce Ferret-UI 2, a multimodal large language model (MLLM) designed for universal UI understanding across a wide range of platforms, including iPhone, Android, iPad, Webpage, and AppleTV. Building on the foundation of Ferret-UI, Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting. These advancements enable Ferret-UI 2 to perform complex, user-centered interactions, making it highly versatile and adaptable for the expanding diversity of platform ecosystems. Extensive empirical experiments on referring, grounding, user-centric advanced tasks (comprising 9 subtasks $ imes$ 5 platforms), GUIDE next-action prediction dataset, and GUI-World multi-platform benchmark demonstrate that Ferret-UI 2 significantly outperforms Ferret-UI, and also shows strong cross-platform transfer capabilities.