Decoding Memes: Benchmarking Narrative Role Classification across Multilingual and Multimodal Models

📅 2025-06-29

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This study addresses the challenge of narrative role identification (e.g., hero, villain, victim) in internet memes, with particular emphasis on culturally sensitive “victim” classification and cross-lingual generalization in English–Hindi code-mixed contexts. Methodologically, we introduce the first multilingual, multimodal meme role dataset enriched with cultural annotations, systematically capturing authentic linguistic and visual features of memes. We propose a hybrid structured prompting strategy that jointly fine-tunes multilingual DeBERTa-v3, instruction-tuned large vision-language models (Qwen2.5-VL), and sentiment/abuse-aware classifiers, benchmarking performance under zero-shot settings. Results show Qwen2.5-VL and DeBERTa-v3 achieve top accuracy; hybrid prompting enhances robustness and consistency; yet victim identification and code-switched content generalization remain challenging. Our core contribution lies in advancing narrative understanding from synthetic benchmarks toward real-world, culturally embedded, and linguistically diverse meme ecosystems.

Technology Category

Application Category

📝 Abstract

This work investigates the challenging task of identifying narrative roles - Hero, Villain, Victim, and Other - in Internet memes, across three diverse test sets spanning English and code-mixed (English-Hindi) languages. Building on an annotated dataset originally skewed toward the 'Other' class, we explore a more balanced and linguistically diverse extension, originally introduced as part of the CLEF 2024 shared task. Comprehensive lexical and structural analyses highlight the nuanced, culture-specific, and context-rich language used in real memes, in contrast to synthetically curated hateful content, which exhibits explicit and repetitive lexical markers. To benchmark the role detection task, we evaluate a wide spectrum of models, including fine-tuned multilingual transformers, sentiment and abuse-aware classifiers, instruction-tuned LLMs, and multimodal vision-language models. Performance is assessed under zero-shot settings using precision, recall, and F1 metrics. While larger models like DeBERTa-v3 and Qwen2.5-VL demonstrate notable gains, results reveal consistent challenges in reliably identifying the 'Victim' class and generalising across cultural and code-mixed content. We also explore prompt design strategies to guide multimodal models and find that hybrid prompts incorporating structured instructions and role definitions offer marginal yet consistent improvements. Our findings underscore the importance of cultural grounding, prompt engineering, and multimodal reasoning in modelling subtle narrative framings in visual-textual content.

Problem

Research questions and friction points this paper is trying to address.

Identifying narrative roles in multilingual and multimodal memes

Balancing dataset for skewed 'Other' class in meme analysis

Improving model performance on 'Victim' class and code-mixed content

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual transformers for role detection

Hybrid prompts enhance multimodal models

Cultural grounding improves narrative framing

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs