MMSRARec: Summarization and Retrieval Augumented Sequential Recommendation Based on Multimodal Large Language Model

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three key bottlenecks in applying multimodal large language models (MLLMs) to sequential recommendation—weak interpretability of item representations, high inference overhead (requiring multiple LLM calls), and absence of explicit collaborative signals—this paper proposes a summarization-and-retrieval-augmented framework. Methodologically, it integrates MLLMs with reward-driven summarization optimization, keyword-based semantic encoding, and retrieval-augmented generation (RAG). Its core contributions are: (1) an adaptive keyword summarization strategy under multi-task supervised fine-tuning, compressing user behavior into interpretable, semantically grounded keywords; and (2) the first explicit encoding of collaborative signals as keywords, incorporated as contextual cues into the RAG pipeline. Evaluated on mainstream benchmarks, the approach achieves significant accuracy gains while enabling single-LLM-call inference, keyword-level behavioral interpretation, and end-to-end collaborative awareness.

Technology Category

Application Category

📝 Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant potential in recommendation systems. However, the effective application of MLLMs to multimodal sequential recommendation remains unexplored: A) Existing methods primarily leverage the multimodal semantic understanding capabilities of pre-trained MLLMs to generate item embeddings or semantic IDs, thereby enhancing traditional recommendation models. These approaches generate item representations that exhibit limited interpretability, and pose challenges when transferring to language model-based recommendation systems. B) Other approaches convert user behavior sequence into image-text pairs and perform recommendation through multiple MLLM inference, incurring prohibitive computational and time costs. C) Current MLLM-based recommendation systems generally neglect the integration of collaborative signals. To address these limitations while balancing recommendation performance, interpretability, and computational cost, this paper proposes MultiModal Summarization-and-Retrieval-Augmented Sequential Recommendation. Specifically, we first employ MLLM to summarize items into concise keywords and fine-tune the model using rewards that incorporate summary length, information loss, and reconstruction difficulty, thereby enabling adaptive adjustment of the summarization policy. Inspired by retrieval-augmented generation, we then transform collaborative signals into corresponding keywords and integrate them as supplementary context. Finally, we apply supervised fine-tuning with multi-task learning to align the MLLM with the multimodal sequential recommendation. Extensive evaluations on common recommendation datasets demonstrate the effectiveness of MMSRARec, showcasing its capability to efficiently and interpretably understand user behavior histories and item information for accurate recommendations.
Problem

Research questions and friction points this paper is trying to address.

Improves interpretability of multimodal sequential recommendation systems
Reduces computational costs of MLLM-based recommendation inference
Integrates collaborative signals into multimodal recommendation frameworks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses MLLM to summarize items into adaptive keywords
Integrates collaborative signals as retrieval-augmented context
Applies multi-task fine-tuning for multimodal sequential recommendation
🔎 Similar Papers
No similar papers found.
H
Haoyu Wang
College of Computer Science and Artificial Intelligence, Fudan University
Yitong Wang
Yitong Wang
ByteDance Inc.
computer vision
J
Jining Wang
College of Computer Science and Artificial Intelligence, Fudan University