Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing video recommendation systems rely on low-level visual/acoustic features or manually curated metadata, limiting their capacity to model high-level semantics—such as user intent, humor, and commonsense knowledge—and thus constraining personalization. To address this, we propose a fine-tuning-free universal framework that leverages off-the-shelf multimodal large language models (MLLMs) to generate rich, natural-language video descriptions. These descriptions are integrated with advanced text encoders and combined with collaborative filtering, content-based, and generative recommendation models. Our approach bridges the semantic gap between low-level features and users’ latent preferences. Evaluated on the MicroLens-100K dataset, incorporating MLLM-generated descriptions consistently improves performance across five state-of-the-art recommendation models, demonstrating both the effectiveness and broad applicability of injecting high-level semantic representations into video recommendation.

Technology Category

Application Category

📝 Abstract

Existing video recommender systems rely primarily on user-defined metadata or on low-level visual and acoustic signals extracted by specialised encoders. These low-level features describe what appears on the screen but miss deeper semantics such as intent, humour, and world knowledge that make clips resonate with viewers. For example, is a 30-second clip simply a singer on a rooftop, or an ironic parody filmed amid the fairy chimneys of Cappadocia, Turkey? Such distinctions are critical to personalised recommendations yet remain invisible to traditional encoding pipelines. In this paper, we introduce a simple, recommendation system-agnostic zero-finetuning framework that injects high-level semantics into the recommendation pipeline by prompting an off-the-shelf Multimodal Large Language Model (MLLM) to summarise each clip into a rich natural-language description (e.g. "a superhero parody with slapstick fights and orchestral stabs"), bridging the gap between raw content and user intent. We use MLLM output with a state-of-the-art text encoder and feed it into standard collaborative, content-based, and generative recommenders. On the MicroLens-100K dataset, which emulates user interactions with TikTok-style videos, our framework consistently surpasses conventional video, audio, and metadata features in five representative models. Our findings highlight the promise of leveraging MLLMs as on-the-fly knowledge extractors to build more intent-aware video recommenders.

Problem

Research questions and friction points this paper is trying to address.

Enhancing video recommendations with high-level semantic understanding

Bridging gap between raw content and user intent using MLLMs

Improving personalization by capturing deeper video semantics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses MLLM for rich video descriptions

Integrates MLLM output with text encoder

Enhances standard recommender systems

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs