🤖 AI Summary
Existing video recommendation systems rely on low-level visual/acoustic features or manually curated metadata, limiting their capacity to model high-level semantics—such as user intent, humor, and commonsense knowledge—and thus constraining personalization. To address this, we propose a fine-tuning-free universal framework that leverages off-the-shelf multimodal large language models (MLLMs) to generate rich, natural-language video descriptions. These descriptions are integrated with advanced text encoders and combined with collaborative filtering, content-based, and generative recommendation models. Our approach bridges the semantic gap between low-level features and users’ latent preferences. Evaluated on the MicroLens-100K dataset, incorporating MLLM-generated descriptions consistently improves performance across five state-of-the-art recommendation models, demonstrating both the effectiveness and broad applicability of injecting high-level semantic representations into video recommendation.
📝 Abstract
Existing video recommender systems rely primarily on user-defined metadata or on low-level visual and acoustic signals extracted by specialised encoders. These low-level features describe what appears on the screen but miss deeper semantics such as intent, humour, and world knowledge that make clips resonate with viewers. For example, is a 30-second clip simply a singer on a rooftop, or an ironic parody filmed amid the fairy chimneys of Cappadocia, Turkey? Such distinctions are critical to personalised recommendations yet remain invisible to traditional encoding pipelines. In this paper, we introduce a simple, recommendation system-agnostic zero-finetuning framework that injects high-level semantics into the recommendation pipeline by prompting an off-the-shelf Multimodal Large Language Model (MLLM) to summarise each clip into a rich natural-language description (e.g. "a superhero parody with slapstick fights and orchestral stabs"), bridging the gap between raw content and user intent. We use MLLM output with a state-of-the-art text encoder and feed it into standard collaborative, content-based, and generative recommenders. On the MicroLens-100K dataset, which emulates user interactions with TikTok-style videos, our framework consistently surpasses conventional video, audio, and metadata features in five representative models. Our findings highlight the promise of leveraging MLLMs as on-the-fly knowledge extractors to build more intent-aware video recommenders.