Enhancing Meme Emotion Understanding with Multi-Level Modality Enhancement and Dual-Stage Modal Fusion

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing meme emotion understanding (MEU) approaches suffer from coarse-grained multimodal fusion and insufficient modeling of implicit semantics. To address these limitations, we propose a hierarchical multimodal enhancement framework: (1) a four-step text enhancement module leveraging multimodal large language models (MLLMs) to strengthen textual reasoning; and (2) a two-stage fusion mechanism—comprising shallow cross-modal alignment and deep semantic complementarity—to enable fine-grained image–text co-modeling. This design significantly improves the modeling of implicit emotional cues and background knowledge. Our method achieves state-of-the-art performance, boosting F1 scores by 4.3% on MET-MEME and 3.4% on MOOD. The core contribution lies in the first integration of MLLM-driven textual reasoning with dual-stage feature fusion, advancing MEU along both semantic depth and cross-modal synergy dimensions.

Technology Category

Application Category

📝 Abstract

With the rapid rise of social media and Internet culture, memes have become a popular medium for expressing emotional tendencies. This has sparked growing interest in Meme Emotion Understanding (MEU), which aims to classify the emotional intent behind memes by leveraging their multimodal contents. While existing efforts have achieved promising results, two major challenges remain: (1) a lack of fine-grained multimodal fusion strategies, and (2) insufficient mining of memes'implicit meanings and background knowledge. To address these challenges, we propose MemoDetector, a novel framework for advancing MEU. First, we introduce a four-step textual enhancement module that utilizes the rich knowledge and reasoning capabilities of Multimodal Large Language Models (MLLMs) to progressively infer and extract implicit and contextual insights from memes. These enhanced texts significantly enrich the original meme contents and provide valuable guidance for downstream classification. Next, we design a dual-stage modal fusion strategy: the first stage performs shallow fusion on raw meme image and text, while the second stage deeply integrates the enhanced visual and textual features. This hierarchical fusion enables the model to better capture nuanced cross-modal emotional cues. Experiments on two datasets, MET-MEME and MOOD, demonstrate that our method consistently outperforms state-of-the-art baselines. Specifically, MemoDetector improves F1 scores by 4.3% on MET-MEME and 3.4% on MOOD. Further ablation studies and in-depth analyses validate the effectiveness and robustness of our approach, highlighting its strong potential for advancing MEU. Our code is available at https://github.com/singing-cat/MemoDetector.

Problem

Research questions and friction points this paper is trying to address.

Classifying emotional intent in memes using multimodal content analysis

Addressing lack of fine-grained multimodal fusion strategies for memes

Improving mining of implicit meanings and background knowledge in memes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses MLLMs to extract implicit textual insights

Implements dual-stage fusion for cross-modal features

Enhances meme emotion classification with hierarchical integration

🔎 Similar Papers

HateSieve: A Contrastive Learning Framework for Detecting and Segmenting Hateful Content in Multimodal Memes