UISearch: Graph-Based Embeddings for Multimodal Enterprise UI Screenshots Retrieval

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Maintaining design consistency, discovering UI patterns, and verifying regulatory compliance in enterprise-scale software are hindered by the sheer volume of UI screenshots. To address this, we propose a graph-based multimodal UI retrieval framework. Our method explicitly models UI screenshots as attributed graphs, integrating visual, structural, and semantic features across three complementary modalities, and introduces a composable query language for fine-grained, semantics-aware retrieval. We employ a contrastive graph autoencoder coupled with multimodal embedding learning, and deploy a hybrid indexing architecture to enable efficient, real-time querying. Evaluated on a dataset of 20,396 financial software UI screens, our framework achieves a Top-5 retrieval accuracy of 0.92, with median and P95 latencies of 47.5 ms and 124 ms, respectively. This represents a significant improvement in both precision and efficiency for large-scale UI asset management.

Technology Category

Application Category

📝 Abstract
Enterprise software companies maintain thousands of user interface screens across products and versions, creating critical challenges for design consistency, pattern discovery, and compliance check. Existing approaches rely on visual similarity or text semantics, lacking explicit modeling of structural properties fundamental to user interface (UI) composition. We present a novel graph-based representation that converts UI screenshots into attributed graphs encoding hierarchical relationships and spatial arrangements, potentially generalizable to document layouts, architectural diagrams, and other structured visual domains. A contrastive graph autoencoder learns embeddings preserving multi-level similarity across visual, structural, and semantic properties. The comprehensive analysis demonstrates that our structural embeddings achieve better discriminative power than state-of-the-art Vision Encoders, representing a fundamental advance in the expressiveness of the UI representation. We implement this representation in UISearch, a multi-modal search framework that combines structural embeddings with semantic search through a composable query language. On 20,396 financial software UIs, UISearch achieves 0.92 Top-5 accuracy with 47.5ms median latency (P95: 124ms), scaling to 20,000+ screens. The hybrid indexing architecture enables complex queries and supports fine-grained UI distinction impossible with vision-only approaches.
Problem

Research questions and friction points this paper is trying to address.

Retrieving enterprise UI screenshots using multimodal structural embeddings
Addressing limitations of visual-only approaches in UI representation
Enabling complex queries for design consistency and compliance checks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-based representation converts UI screenshots into attributed graphs
Contrastive graph autoencoder learns multimodal similarity embeddings
Hybrid indexing architecture combines structural and semantic search
🔎 Similar Papers
No similar papers found.
M
Maroun Ayli
Center For Computer Science, Saint Joseph University of Beirut, Beirut Lebanon
Y
Youssef Bakouny
Center For Computer Science, Saint Joseph University of Beirut, Beirut Lebanon
Tushar Sharma
Tushar Sharma
Asst. professor, FCS, Dalhousie University
Software engineeringMachine learning for software engineeringGreen AI
N
Nader Jalloul
Murex, Paris, France
H
Hani Seifeddine
Murex, Paris, France
R
Rima Kilany
Center For Computer Science, Saint Joseph University of Beirut, Beirut Lebanon