🤖 AI Summary
Existing long-video recommendation methods are largely confined to unimodal approaches or simplistic multimodal fusion, lacking a unified, scalable, and reproducible multimodal movie recommendation benchmark.
Method: We introduce the first open-source, reproducible multimodal movie recommendation benchmark supporting joint modeling of visual, audio, and textual modalities. Our LLM-enhanced framework leverages LLaMA-2 to generate high-quality movie summaries—alleviating sparse metadata—and integrates i-vector/CNN/AVF audiovisual encoders with Sentence-T5 for text encoding, enabling early-, middle-, and late-fusion strategies. It further incorporates mainstream recommendation backbones including Matrix Factorization (MF) and Variational Autoencoder-based Collaborative Filtering (VAECF).
Results: Experiments demonstrate that LLM-generated textual features combined with strong text embeddings significantly improve cold-start performance (+18.7%) and coverage (+22.3%). The project releases open-source code, precomputed multimodal embeddings, and configuration files to foster fair and reproducible multimodal recommendation research.
📝 Abstract
Recommending long-form video content demands joint modeling of visual, audio, and textual modalities, yet most benchmarks address only raw features or narrow fusion. We present ViLLA-MMBench, a reproducible, extensible benchmark for LLM-augmented multimodal movie recommendation. Built on MovieLens and MMTF-14K, it aligns dense item embeddings from three modalities: audio (block-level, i-vector), visual (CNN, AVF), and text. Missing or sparse metadata is automatically enriched using state-of-the-art LLMs (e.g., OpenAI Ada), generating high-quality synopses for thousands of movies. All text (raw or augmented) is embedded with configurable encoders (Ada, LLaMA-2, Sentence-T5), producing multiple ready-to-use sets. The pipeline supports interchangeable early-, mid-, and late-fusion (concatenation, PCA, CCA, rank-aggregation) and multiple backbones (MF, VAECF, VBPR, AMR, VMF) for ablation. Experiments are fully declarative via a single YAML file. Evaluation spans accuracy (Recall, nDCG) and beyond-accuracy metrics: cold-start rate, coverage, novelty, diversity, fairness. Results show LLM-based augmentation and strong text embeddings boost cold-start and coverage, especially when fused with audio-visual features. Systematic benchmarking reveals universal versus backbone- or metric-specific combinations. Open-source code, embeddings, and configs enable reproducible, fair multimodal RS research and advance principled generative AI integration in large-scale recommendation. Code: https://recsys-lab.github.io/ViLLA-MMBench