SIREN: Unified Multi-Granularity Semantic Interaction for Multi-Modal Lifelong User Interest Modeling

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches to user lifetime interest modeling typically employ separate modeling and late fusion of multimodal features and collaborative signals, leading to semantic misalignment and coarse-grained representations. To address this limitation, this work proposes a unified multi-granularity semantic interaction framework that innovatively integrates target-aware coarse-to-fine retrieval—encompassing both multimodal soft retrieval and SemID-based hard retrieval—with a target-conditioned Transformer architecture, thereby enabling deep semantic alignment between multimodal and collaborative signals. The proposed method achieves state-of-the-art performance on offline evaluation metrics and demonstrates significant online gains in A/B tests, boosting GMV by 2.28%, 3.87%, and 1.61% on WeChat Moments, Official Accounts, and Channels, respectively. The system was fully deployed on Tencent’s advertising platform in July 2025.
📝 Abstract
Industrial recommender systems increasingly leverage lifelong user behavior histories and rich multi-modal content to capture evolving user preferences. However, effectively integrating multi-modal features into lifelong interest modeling remains challenging due to the inherent misalignment between multi-modal and collaborative spaces. Existing paradigms typically rely on separate modeling of multi-modal sequence and behavior sequence, and late fusion to alleviate the modality gap, which results in coarse-grained multi-modal representation and limited integration. In this paper, we propose SIREN, a unified multi-granularity semantic interaction framework for multi-modal lifelong user interest modeling. In the General Search Unit stage, we introduce two alternative retrieval strategies: multi-modal similarity-based soft retrieval for retrieval effectiveness, and Semantic ID (SemID)-based hard retrieval for efficient industrial serving. For the Exact Search Unit stage, we explicitly incorporate target-aware relevance via coarse similarity buckets and fine-grained prefix-encoded SemIDs, enabling unified interaction with collaborative ID features within the target-conditioned transformer architecture. Extensive experiments on the offline dataset demonstrate that SIREN achieves a state-of-the-art GAUC. Online A/B tests further demonstrate consistent GMV gains across multiple production scenarios, including +2.28% in Weixin Moments, +3.87% in Weixin Official Accounts, and +1.61% in Weixin Channels. From July 2025, SIREN has been fully launched for full-traffic serving in Tencent's advertising platform.
Problem

Research questions and friction points this paper is trying to address.

multi-modal
lifelong user interest modeling
semantic interaction
modality gap
user preference
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-granularity semantic interaction
Semantic ID (SemID)
target-conditioned transformer
multi-modal lifelong interest modeling
unified representation learning
🔎 Similar Papers
No similar papers found.