DashCLIP: Leveraging multimodal models for generating semantic embeddings for DoorDash

πŸ“… 2025-03-18
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 2
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing multimodal models struggle to capture fine-grained semantic relationships between merchant items and user queries on the DoorDash platform. Method: We propose a joint multimodal embedding learning framework that requires no user behavioral history. It employs an image–text contrastive learning objective to align pretrained unimodal encoders (CLIP, BERT) with a customized multimodal encoder. To reduce reliance on proprietary business signals, we introduce the first large-scale relevance annotation dataset synthesized via large language models (LLMs). Furthermore, we enable end-to-end joint optimization of both unimodal and multimodal encoders. Results: Experiments demonstrate substantial improvements in item classification and relevance prediction. In advertising recommendation, the method achieves 12.3% lift in click-through rate (CTR) and 9.7% lift in conversion rate (CVR), validating its cross-task generalization capability and direct business impact.

Technology Category

Application Category

πŸ“ Abstract
Despite the success of vision-language models in various generative tasks, obtaining high-quality semantic representations for products and user intents is still challenging due to the inability of off-the-shelf models to capture nuanced relationships between the entities. In this paper, we introduce a joint training framework for product and user queries by aligning uni-modal and multi-modal encoders through contrastive learning on image-text data. Our novel approach trains a query encoder with an LLM-curated relevance dataset, eliminating the reliance on engagement history. These embeddings demonstrate strong generalization capabilities and improve performance across applications, including product categorization and relevance prediction. For personalized ads recommendation, a significant uplift in the click-through rate and conversion rate after the deployment further confirms the impact on key business metrics. We believe that the flexibility of our framework makes it a promising solution toward enriching the user experience across the e-commerce landscape.
Problem

Research questions and friction points this paper is trying to address.

Generating high-quality semantic embeddings for products and user intents
Aligning uni-modal and multi-modal encoders through contrastive learning
Improving e-commerce applications like categorization and recommendation systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint training framework aligning multimodal encoders
Query encoder trained with LLM-curated relevance dataset
Contrastive learning on image-text data for embeddings
Omkar Gurjar
Omkar Gurjar
University of Illinois, Urbana-Champaign
Social Media AnalysisSocial ComputingNatural Language Processing
Kin Sum Liu
Kin Sum Liu
Twitter
P
Praveen Kolli
DoorDash, Inc.
U
Utsaw Kumar
DoorDash, Inc.
M
Mandar Rahurkar
DoorDash, Inc.