🤖 AI Summary
Existing GUI agents exhibit poor transferability in dynamic digital environments, struggling to adapt to interface variations across app versions, platforms (iOS/Android/Web), and applications—leading to instruction grounding failures. Method: We introduce TransBench, the first systematic benchmark for evaluating GUI agent transferability, formally defining and quantifying transfer capability along three dimensions: version, platform, and application. It comprises a standardized test suite covering 15 mainstream application categories across multiple versions and platforms, and establishes a comprehensive evaluation protocol integrating multi-source GUI data collection, vision-language alignment, interface element abstraction, and cross-domain generalization assessment. Contribution/Results: Experiments demonstrate that TransBench significantly improves grounding accuracy and robustly validates agent performance under frequent UI updates and platform heterogeneity. It is the first benchmark to systematically address GUI transferability evaluation, thereby filling a critical gap in the field.
📝 Abstract
Graphical User Interface (GUI) agents, which autonomously operate on digital interfaces through natural language instructions, hold transformative potential for accessibility, automation, and user experience. A critical aspect of their functionality is grounding - the ability to map linguistic intents to visual and structural interface elements. However, existing GUI agents often struggle to adapt to the dynamic and interconnected nature of real-world digital environments, where tasks frequently span multiple platforms and applications while also being impacted by version updates. To address this, we introduce TransBench, the first benchmark designed to systematically evaluate and enhance the transferability of GUI agents across three key dimensions: cross-version transferability (adapting to version updates), cross-platform transferability (generalizing across platforms like iOS, Android, and Web), and cross-application transferability (handling tasks spanning functionally distinct apps). TransBench includes 15 app categories with diverse functionalities, capturing essential pages across versions and platforms to enable robust evaluation. Our experiments demonstrate significant improvements in grounding accuracy, showcasing the practical utility of GUI agents in dynamic, real-world environments. Our code and data will be publicly available at Github.