🤖 AI Summary
UX practitioners often struggle to retrieve semantically matched user flows—i.e., screen sequences representing user tasks—hindering exemplar-based communication with design and development teams. To address this, we propose the first vision-language cross-modal retrieval method tailored for user flows. Our approach employs contrastive learning to jointly embed screen images (extracted via CNN or ViT) and natural language task descriptions, thereby establishing a human-perceptually grounded relevance metric. Crucially, we pioneer the application of contrastive learning to user flow representation, enabling precise retrieval of semantically consistent screen sequences given natural language queries. Human-in-the-loop relevance evaluation demonstrates that our method significantly outperforms baseline approaches in judging task-level semantic similarity, empirically validating both the effectiveness of visual embeddings for modeling user flows and their practical utility in real-world UX workflows.
📝 Abstract
Effective communication of UX considerations to stakeholders (e.g., designers and developers) is a critical challenge for UX practitioners. To explore this problem, we interviewed four UX practitioners about their communication challenges and strategies. Our study identifies that providing an example user flow-a screen sequence representing a semantic task-as evidence reinforces communication, yet finding relevant examples remains challenging. To address this, we propose a method to systematically retrieve user flows using semantic embedding. Specifically, we design a model that learns to associate screens' visual features with user flow descriptions through contrastive learning. A survey confirms that our approach retrieves user flows better aligned with human perceptions of relevance. We analyze the results and discuss implications for the computational representation of user flows.