SIREN: Unified Multi-Granularity Semantic Interaction for Multi-Modal Lifelong User Interest Modeling

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

Existing approaches to user lifetime interest modeling typically employ separate modeling and late fusion of multimodal features and collaborative signals, leading to semantic misalignment and coarse-grained representations. To address this limitation, this work proposes a unified multi-granularity semantic interaction framework that innovatively integrates target-aware coarse-to-fine retrieval—encompassing both multimodal soft retrieval and SemID-based hard retrieval—with a target-conditioned Transformer architecture, thereby enabling deep semantic alignment between multimodal and collaborative signals. The proposed method achieves state-of-the-art performance on offline evaluation metrics and demonstrates significant online gains in A/B tests, boosting GMV by 2.28%, 3.87%, and 1.61% on WeChat Moments, Official Accounts, and Channels, respectively. The system was fully deployed on Tencent’s advertising platform in July 2025.

📝 Abstract

Industrial recommender systems increasingly leverage lifelong user behavior histories and rich multi-modal content to capture evolving user preferences. However, effectively integrating multi-modal features into lifelong interest modeling remains challenging due to the inherent misalignment between multi-modal and collaborative spaces. Existing paradigms typically rely on separate modeling of multi-modal sequence and behavior sequence, and late fusion to alleviate the modality gap, which results in coarse-grained multi-modal representation and limited integration. In this paper, we propose SIREN, a unified multi-granularity semantic interaction framework for multi-modal lifelong user interest modeling. In the General Search Unit stage, we introduce two alternative retrieval strategies: multi-modal similarity-based soft retrieval for retrieval effectiveness, and Semantic ID (SemID)-based hard retrieval for efficient industrial serving. For the Exact Search Unit stage, we explicitly incorporate target-aware relevance via coarse similarity buckets and fine-grained prefix-encoded SemIDs, enabling unified interaction with collaborative ID features within the target-conditioned transformer architecture. Extensive experiments on the offline dataset demonstrate that SIREN achieves a state-of-the-art GAUC. Online A/B tests further demonstrate consistent GMV gains across multiple production scenarios, including +2.28% in Weixin Moments, +3.87% in Weixin Official Accounts, and +1.61% in Weixin Channels. From July 2025, SIREN has been fully launched for full-traffic serving in Tencent's advertising platform.

Problem

Research questions and friction points this paper is trying to address.

multi-modal

lifelong user interest modeling

semantic interaction

modality gap

user preference

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-granularity semantic interaction

Semantic ID (SemID)

target-conditioned transformer