🤖 AI Summary
Multimodal large language models (MLLMs) generate semantic IDs whose representations misalign with collaborative filtering (CF) signals, while conventional two-stage alignment incurs information loss and inflexible optimization. Method: We propose a single-stage dual-alignment semantic ID framework that jointly optimizes discrete quantization and cross-modal alignment. It introduces a multi-view contrastive alignment mechanism and a bidirectional dual-learning strategy to enable adaptive, flexible alignment of user- and ad-side semantic IDs. Furthermore, it integrates MLLM embeddings, ID-based CF debiasing, and triple co-occurrence structures (u2i, i2i, u2u) into a unified contrastive learning objective. Contribution/Results: Deployed across multiple advertising scenarios at Kuaishou, the method serves over 400 million users daily. Offline evaluations and online A/B tests demonstrate statistically significant improvements in recommendation accuracy and system efficiency.
📝 Abstract
Semantic IDs are discrete identifiers generated by quantizing the Multi-modal Large Language Models (MLLMs) embeddings, enabling efficient multi-modal content integration in recommendation systems. However, their lack of collaborative signals results in a misalignment with downstream discriminative and generative recommendation objectives. Recent studies have introduced various alignment mechanisms to address this problem, but their two-stage framework design still leads to two main limitations: (1) inevitable information loss during alignment, and (2) inflexibility in applying adaptive alignment strategies, consequently constraining the mutual information maximization during the alignment process. To address these limitations, we propose a novel and flexible one-stage Dual-Aligned Semantic IDs (DAS) method that simultaneously optimizes quantization and alignment, preserving semantic integrity and alignment quality while avoiding the information loss typically associated with two-stage methods. Meanwhile, DAS achieves more efficient alignment between the semantic IDs and collaborative signals, with the following two innovative and effective approaches: (1) Multi-view Constrative Alignment: To maximize mutual information between semantic IDs and collaborative signals, we first incorporate an ID-based CF debias module, and then design three effective contrastive alignment methods: dual user-to-item (u2i), dual item-to-item/user-to-user (i2i/u2u), and dual co-occurrence item-to-item/user-to-user (i2i/u2u). (2) Dual Learning: By aligning the dual quantizations of users and ads, the constructed semantic IDs for users and ads achieve stronger alignment. Finally, we conduct extensive offline experiments and online A/B tests to evaluate DAS's effectiveness, which is now successfully deployed across various advertising scenarios at Kuaishou App, serving over 400 million users daily.