🤖 AI Summary
Existing multimodal embedding models suffer from narrow modality coverage, training instability, and poor domain adaptability in industrial settings. To address these issues, this paper proposes a unified all-modal embedding foundation model. Methodologically, we design a multi-stage collaborative training framework integrating content-aware progressive learning and collaboration-aware recommendation enhancement; introduce a randomized specialization mechanism coupled with dataset-driven modality matching; and adopt a multi-tower architecture that jointly leverages large vision-language models (VLMs) and dual-path distillation—from sequences/IDs to items—to capture fine-grained user interests. The model significantly improves generalization and industrial robustness on cross-modal retrieval and recommendation tasks. Experiments demonstrate state-of-the-art performance across multiple benchmarks. In Douyin’s “Selected Content” scenario, it achieves +0.158% and +0.144% gains in 7-day and 14-day long-term user retention, respectively, and improves feed ranking AUC by 0.08%.
📝 Abstract
Multimodal embedding models aim to yield informative unified representations that empower diverse cross-modal tasks. Despite promising developments in the evolution from CLIP-based dual-tower architectures to large vision-language models, prior works still face unavoidable challenges in real-world applications and business scenarios, such as the limited modality support, unstable training mechanisms, and industrial domain gaps. In this work, we introduce SAIL-Embedding, an omni-modal embedding foundation model that addresses these issues through tailored training strategies and architectural design. In the optimization procedure, we propose a multi-stage training scheme to boost the multifaceted effectiveness of representation learning. Specifically, the content-aware progressive training aims to enhance the model's adaptability to diverse downstream tasks and master enriched cross-modal proficiency. The collaboration-aware recommendation enhancement training further adapts multimodal representations for recommendation scenarios by distilling knowledge from sequence-to-item and ID-to-item embeddings while mining user historical interests. Concurrently, we develop the stochastic specialization and dataset-driven pattern matching to strengthen model training flexibility and generalizability. Experimental results show that SAIL-Embedding achieves SOTA performance compared to other methods in different retrieval tasks. In online experiments across various real-world scenarios integrated with our model, we observe a significant increase in Lifetime (LT), which is a crucial indicator for the recommendation experience. For instance, the model delivers the 7-day LT gain of +0.158% and the 14-day LT gain of +0.144% in the Douyin-Selected scenario. For the Douyin feed rank model, the match features produced by SAIL-Embedding yield a +0.08% AUC gain.