🤖 AI Summary
Maintaining design consistency, discovering UI patterns, and verifying regulatory compliance in enterprise-scale software are hindered by the sheer volume of UI screenshots. To address this, we propose a graph-based multimodal UI retrieval framework. Our method explicitly models UI screenshots as attributed graphs, integrating visual, structural, and semantic features across three complementary modalities, and introduces a composable query language for fine-grained, semantics-aware retrieval. We employ a contrastive graph autoencoder coupled with multimodal embedding learning, and deploy a hybrid indexing architecture to enable efficient, real-time querying. Evaluated on a dataset of 20,396 financial software UI screens, our framework achieves a Top-5 retrieval accuracy of 0.92, with median and P95 latencies of 47.5 ms and 124 ms, respectively. This represents a significant improvement in both precision and efficiency for large-scale UI asset management.
📝 Abstract
Enterprise software companies maintain thousands of user interface screens across products and versions, creating critical challenges for design consistency, pattern discovery, and compliance check. Existing approaches rely on visual similarity or text semantics, lacking explicit modeling of structural properties fundamental to user interface (UI) composition. We present a novel graph-based representation that converts UI screenshots into attributed graphs encoding hierarchical relationships and spatial arrangements, potentially generalizable to document layouts, architectural diagrams, and other structured visual domains. A contrastive graph autoencoder learns embeddings preserving multi-level similarity across visual, structural, and semantic properties. The comprehensive analysis demonstrates that our structural embeddings achieve better discriminative power than state-of-the-art Vision Encoders, representing a fundamental advance in the expressiveness of the UI representation. We implement this representation in UISearch, a multi-modal search framework that combines structural embeddings with semantic search through a composable query language. On 20,396 financial software UIs, UISearch achieves 0.92 Top-5 accuracy with 47.5ms median latency (P95: 124ms), scaling to 20,000+ screens. The hybrid indexing architecture enables complex queries and supports fine-grained UI distinction impossible with vision-only approaches.