🤖 AI Summary
To address metadata sparsity in movie recommendation—leading to cold-start challenges and limited recommendation novelty—this paper proposes a unified framework integrating generative text augmentation with multimodal vision–language understanding. Methodologically, it employs large language models (LLMs) to generate high-quality plot descriptions to compensate for missing textual metadata; extracts visual embeddings from video frames and aligns them with textual features via canonical correlation analysis (CCA) for dimensionality reduction and cross-modal fusion; and enhances ranking through a hybrid retrieval-augmented generation (RAG) and collaborative filtering pipeline, incorporating an LLM-based re-ranking module. Experiments demonstrate that CCA-based fusion significantly improves recall, while LLM re-ranking substantially boosts NDCG@10 (+12.3%) under text-constrained conditions. The framework achieves superior performance in cold-start and long-tail recommendation scenarios. The implementation is open-sourced.
📝 Abstract
This paper addresses the challenge of developing multimodal recommender systems for the movie domain, where limited metadata (e.g., title, genre) often hinders the generation of robust recommendations. We introduce a resource that combines LLM-generated plot descriptions with trailer-derived visual embeddings in a unified pipeline supporting both Retrieval-Augmented Generation (RAG) and collaborative filtering. Central to our approach is a data augmentation step that transforms sparse metadata into richer textual signals, alongside fusion strategies (e.g., PCA, CCA) that integrate visual cues. Experimental evaluations demonstrate that CCA-based fusion significantly boosts recall compared to unimodal baselines, while an LLM-driven re-ranking step further improves NDCG, particularly in scenarios with limited textual data. By releasing this framework, we invite further exploration of multi-modal recommendation techniques tailored to cold-start, novelty-focused, and domain-specific settings. All code, data, and detailed documentation are publicly available at: https://github.com/RecSys-lab/RAG-VisualRec