🤖 AI Summary
Traditional recommender systems struggle to effectively leverage high-dimensional semantic information from multimedia content, limiting the accuracy of user preference modeling. This work proposes a general multimodal large language model framework tailored for large-scale recommendation systems, achieving end-to-end integration within an industrial-grade low-latency architecture for the first time. The approach employs LLaMA2 to generate textual descriptions of images and videos, which are then tokenized and transformed into categorical feature embeddings seamlessly incorporated into the recommendation model. While maintaining efficient inference, the method substantially enhances semantic understanding, yielding a 0.35% improvement in offline AUC and a 0.02% gain in key online metrics, thereby demonstrating the effectiveness and practical value of multimodal large language models in real-world recommendation scenarios.
📝 Abstract
Conventional recommendation systems frequently fail to fully exploit the high-dimensional semantic signals inherent in multimedia content, thereby limiting the fidelity of user preference modeling. While Multimodal Large Language Models (MM-LLMs) offer robust mechanisms for interpreting such complex data, their integration into latency-constrained, industrial-scale architectures remains a significant challenge. To address this, we propose a generalized framework for MM-LLM-driven multimedia understanding. Our methodology employs a tripartite architecture encompassing content interpretation, representation extraction, and systematic pipeline integration, instantiated via a LLaMA2-based model that generates descriptive captions subsequently ingested as tokenized categorical features. Empirical evaluation demonstrates the efficacy of this approach, yielding a $0.35\%$ increase in offline AUC and a $0.02\%$ improvement in online metrics at scale, substantiating the practical viability of leveraging MM-LLMs to enhance large-scale recommendation performance.