🤖 AI Summary
This work addresses the “Ad Memorability Prediction” subtask of MediaEval 2025. Methodologically, it proposes a multimodal memorability modeling framework built upon the Gemma-3 large language model (LLM). Visual features are extracted from ad video frames using Vision Transformer (ViT), while textual features from ad scripts are encoded via E5; both modalities are aligned through learnable projection layers before being fed into Gemma-3. Crucially, expert-defined memorability dimensions—such as emotional intensity and narrative clarity—are incorporated to construct structured reasoning prompts, enabling semantic-aware cross-modal feature fusion. The model is efficiently fine-tuned using Low-Rank Adaptation (LoRA). Compared to strong baselines—including gradient-boosted trees—the LLM-based fusion system achieves significantly improved prediction accuracy on the test set. Moreover, it enhances robustness, generalizability, and interpretability in cross-modal understanding, offering principled insights into the cognitive drivers of ad memorability.
📝 Abstract
This paper addresses the prediction of commercial (brand) memorability as part of "Subtask 2: Commercial/Ad Memorability" within the "Memorability: Predicting movie and commercial memorability" task at the MediaEval 2025 workshop competition. We propose a multimodal fusion system with a Gemma-3 LLM backbone that integrates pre-computed visual (ViT) and textual (E5) features by multi-modal projections. The model is adapted using Low-Rank Adaptation (LoRA). A heavily-tuned ensemble of gradient boosted trees serves as a baseline. A key contribution is the use of LLM-generated rationale prompts, grounded in expert-derived aspects of memorability, to guide the fusion model. The results demonstrate that the LLM-based system exhibits greater robustness and generalization performance on the final test set, compared to the baseline.
The paper's codebase can be found at https://github.com/dsgt-arc/mediaeval-2025-memorability