🤖 AI Summary
Existing industrial recommendation systems rely heavily on ID-based features to model users’ lifelong interests, resulting in poor generalization and weak semantic expressiveness. While General Search Units (GSUs) incorporate multimodal signals, Exact Search Units (ESUs) often neglect their integration. This paper proposes MUSE, the first framework to systematically leverage multimodal signals in both GSU and ESU stages. The GSU employs lightweight cosine-similarity retrieval for coarse-grained candidate generation, whereas the ESU jointly models ID features and deep multimodal behavioral sequences—supporting ultra-long sequences (up to 100K items). MUSE integrates high-quality multimodal embeddings, efficient retrieval mechanisms, and ID-multimodal joint representation learning. Deployed in Taobao’s display advertising system, it significantly improves core metrics—including CTR—with negligible latency overhead. Additionally, we release the first large-scale multimodal behavioral dataset to foster community research.
📝 Abstract
Lifelong user interest modeling is crucial for industrial recommender systems, yet existing approaches rely predominantly on ID-based features, suffering from poor generalization on long-tail items and limited semantic expressiveness. While recent work explores multimodal representations for behavior retrieval in the General Search Unit (GSU), they often neglect multimodal integration in the fine-grained modeling stage -- the Exact Search Unit (ESU). In this work, we present a systematic analysis of how to effectively leverage multimodal signals across both stages of the two-stage lifelong modeling framework. Our key insight is that simplicity suffices in the GSU: lightweight cosine similarity with high-quality multimodal embeddings outperforms complex retrieval mechanisms. In contrast, the ESU demands richer multimodal sequence modeling and effective ID-multimodal fusion to unlock its full potential. Guided by these principles, we propose MUSE, a simple yet effective multimodal search-based framework. MUSE has been deployed in Taobao display advertising system, enabling 100K-length user behavior sequence modeling and delivering significant gains in top-line metrics with negligible online latency overhead. To foster community research, we share industrial deployment practices and open-source the first large-scale dataset featuring ultra-long behavior sequences paired with high-quality multimodal embeddings. Our code and data is available at https://taobao-mm.github.io.