LLM-based Fusion of Multi-modal Features for Commercial Memorability Prediction

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the “Ad Memorability Prediction” subtask of MediaEval 2025. Methodologically, it proposes a multimodal memorability modeling framework built upon the Gemma-3 large language model (LLM). Visual features are extracted from ad video frames using Vision Transformer (ViT), while textual features from ad scripts are encoded via E5; both modalities are aligned through learnable projection layers before being fed into Gemma-3. Crucially, expert-defined memorability dimensions—such as emotional intensity and narrative clarity—are incorporated to construct structured reasoning prompts, enabling semantic-aware cross-modal feature fusion. The model is efficiently fine-tuned using Low-Rank Adaptation (LoRA). Compared to strong baselines—including gradient-boosted trees—the LLM-based fusion system achieves significantly improved prediction accuracy on the test set. Moreover, it enhances robustness, generalizability, and interpretability in cross-modal understanding, offering principled insights into the cognitive drivers of ad memorability.

Technology Category

Application Category

📝 Abstract

This paper addresses the prediction of commercial (brand) memorability as part of "Subtask 2: Commercial/Ad Memorability" within the "Memorability: Predicting movie and commercial memorability" task at the MediaEval 2025 workshop competition. We propose a multimodal fusion system with a Gemma-3 LLM backbone that integrates pre-computed visual (ViT) and textual (E5) features by multi-modal projections. The model is adapted using Low-Rank Adaptation (LoRA). A heavily-tuned ensemble of gradient boosted trees serves as a baseline. A key contribution is the use of LLM-generated rationale prompts, grounded in expert-derived aspects of memorability, to guide the fusion model. The results demonstrate that the LLM-based system exhibits greater robustness and generalization performance on the final test set, compared to the baseline. The paper's codebase can be found at https://github.com/dsgt-arc/mediaeval-2025-memorability

Problem

Research questions and friction points this paper is trying to address.

Predicting commercial memorability using multimodal feature fusion

Integrating visual and textual features with LLM backbone

Enhancing robustness through LLM-generated rationale prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal fusion system with Gemma-3 LLM backbone

Integration of visual and textual features via projections

LLM-generated rationale prompts guide fusion model

🔎 Similar Papers

Fine-tuning Multimodal Large Language Models for Product Bundling