LLM-based Fusion of Multi-modal Features for Commercial Memorability Prediction

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the “Ad Memorability Prediction” subtask of MediaEval 2025. Methodologically, it proposes a multimodal memorability modeling framework built upon the Gemma-3 large language model (LLM). Visual features are extracted from ad video frames using Vision Transformer (ViT), while textual features from ad scripts are encoded via E5; both modalities are aligned through learnable projection layers before being fed into Gemma-3. Crucially, expert-defined memorability dimensions—such as emotional intensity and narrative clarity—are incorporated to construct structured reasoning prompts, enabling semantic-aware cross-modal feature fusion. The model is efficiently fine-tuned using Low-Rank Adaptation (LoRA). Compared to strong baselines—including gradient-boosted trees—the LLM-based fusion system achieves significantly improved prediction accuracy on the test set. Moreover, it enhances robustness, generalizability, and interpretability in cross-modal understanding, offering principled insights into the cognitive drivers of ad memorability.

Technology Category

Application Category

📝 Abstract
This paper addresses the prediction of commercial (brand) memorability as part of "Subtask 2: Commercial/Ad Memorability" within the "Memorability: Predicting movie and commercial memorability" task at the MediaEval 2025 workshop competition. We propose a multimodal fusion system with a Gemma-3 LLM backbone that integrates pre-computed visual (ViT) and textual (E5) features by multi-modal projections. The model is adapted using Low-Rank Adaptation (LoRA). A heavily-tuned ensemble of gradient boosted trees serves as a baseline. A key contribution is the use of LLM-generated rationale prompts, grounded in expert-derived aspects of memorability, to guide the fusion model. The results demonstrate that the LLM-based system exhibits greater robustness and generalization performance on the final test set, compared to the baseline. The paper's codebase can be found at https://github.com/dsgt-arc/mediaeval-2025-memorability
Problem

Research questions and friction points this paper is trying to address.

Predicting commercial memorability using multimodal feature fusion
Integrating visual and textual features with LLM backbone
Enhancing robustness through LLM-generated rationale prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal fusion system with Gemma-3 LLM backbone
Integration of visual and textual features via projections
LLM-generated rationale prompts guide fusion model