A General Framework for Multimodal LLM-Based Multimedia Understanding in Large-Scale Recommendation Systems

📅 2026-05-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

224K/year
🤖 AI Summary
Traditional recommender systems struggle to effectively leverage high-dimensional semantic information from multimedia content, limiting the accuracy of user preference modeling. This work proposes a general multimodal large language model framework tailored for large-scale recommendation systems, achieving end-to-end integration within an industrial-grade low-latency architecture for the first time. The approach employs LLaMA2 to generate textual descriptions of images and videos, which are then tokenized and transformed into categorical feature embeddings seamlessly incorporated into the recommendation model. While maintaining efficient inference, the method substantially enhances semantic understanding, yielding a 0.35% improvement in offline AUC and a 0.02% gain in key online metrics, thereby demonstrating the effectiveness and practical value of multimodal large language models in real-world recommendation scenarios.
📝 Abstract
Conventional recommendation systems frequently fail to fully exploit the high-dimensional semantic signals inherent in multimedia content, thereby limiting the fidelity of user preference modeling. While Multimodal Large Language Models (MM-LLMs) offer robust mechanisms for interpreting such complex data, their integration into latency-constrained, industrial-scale architectures remains a significant challenge. To address this, we propose a generalized framework for MM-LLM-driven multimedia understanding. Our methodology employs a tripartite architecture encompassing content interpretation, representation extraction, and systematic pipeline integration, instantiated via a LLaMA2-based model that generates descriptive captions subsequently ingested as tokenized categorical features. Empirical evaluation demonstrates the efficacy of this approach, yielding a $0.35\%$ increase in offline AUC and a $0.02\%$ improvement in online metrics at scale, substantiating the practical viability of leveraging MM-LLMs to enhance large-scale recommendation performance.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Multimedia Understanding
Recommendation Systems
Semantic Signals
Latency-Constrained Integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models
Recommendation Systems
Multimedia Understanding
Latency-Constrained Architecture
Feature Integration
🔎 Similar Papers
2024-08-08International Workshop on Semantic and Social Media Adaptation and PersonalizationCitations: 13