Multimodal Fusion And Sparse Attention-based Alignment Model for Long Sequential Recommendation

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To bridge the gap between multimodal content understanding and multi-granularity user interest modeling in long-sequence recommendation, this paper proposes MUFASA. Methodologically, it introduces (1) a cross-category semantic anchor-guided Multimodal Fusion Layer (MFL), which jointly leverages title semantics and four customized loss functions to enhance cross-modal representation consistency; and (2) a Sparse Attention Alignment Layer (SAL) that integrates window-, block-, and selective attention mechanisms to jointly model hierarchical interest evolution—capturing both coarse-grained block-level trends and fine-grained behavioral shifts—in long-term user behavior sequences. Extensive experiments on multiple real-world benchmarks demonstrate significant improvements over state-of-the-art methods. Online A/B tests show substantial CTR gains, validating MUFASA’s effectiveness in deep multimodal signal fusion and precise characterization of diverse user preferences.

Technology Category

Application Category

📝 Abstract
Recent advances in multimodal recommendation enable richer item understanding, while modeling users' multi-scale interests across temporal horizons has attracted growing attention. However, effectively exploiting multimodal item sequences and mining multi-grained user interests to substantially bridge the gap between content comprehension and recommendation remain challenging. To address these issues, we propose MUFASA, a MUltimodal Fusion And Sparse Attention-based Alignment model for long sequential recommendation. Our model comprises two core components. First, the Multimodal Fusion Layer (MFL) leverages item titles as a cross-genre semantic anchor and is trained with a joint objective of four tailored losses that promote: (i) cross-genre semantic alignment, (ii) alignment to the collaborative space for recommendation, (iii) preserving the similarity structure defined by titles and preventing modality representation collapse, and (iv) distributional regularization of the fusion space. This yields high-quality fused item representations for further preference alignment. Second, the Sparse Attention-guided Alignment Layer (SAL) scales to long user-behavior sequences via a multi-granularity sparse attention mechanism, which incorporates windowed attention, block-level attention, and selective attention, to capture user interests hierarchically and across temporal horizons. SAL explicitly models both the evolution of coherent interest blocks and fine-grained intra-block variations, producing robust user and item representations. Extensive experiments on real-world benchmarks show that MUFASA consistently surpasses state-of-the-art baselines. Moreover, online A/B tests demonstrate significant gains in production, confirming MUFASA's effectiveness in leveraging multimodal cues and accurately capturing diverse user preferences.
Problem

Research questions and friction points this paper is trying to address.

Bridging content comprehension and recommendation gaps effectively
Exploiting multimodal item sequences for better recommendations
Mining multi-grained user interests across temporal horizons
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Fusion Layer with four tailored losses
Sparse Attention-guided Alignment for long sequences
Hierarchical multi-granularity interest modeling
🔎 Similar Papers
No similar papers found.
Y
Yongrui Fu
Fudan University, Shanghai, China
J
Jian Liu
Baidu, Inc, Beijing, China
T
Tao Li
Baidu, Inc, Beijing, China
Z
Zonggang Wu
Baidu, Inc, Beijing, China
S
Shouke Qin
Baidu, Inc, Beijing, China
Hanmeng Liu
Hanmeng Liu
Associate Professor | Hainan University
Natural language processing