🤖 AI Summary
This work addresses the unclear capability of multimodal large language models (MLLMs) in interpreting metaphorical meanings conveyed through the interplay of text and image in internet memes. To this end, we present the first systematic benchmark specifically designed for meme-based metaphor understanding, encompassing six metaphor categories across three datasets, and introduce a human evaluation protocol to assess the logical coherence and content fidelity of model-generated explanations. Our evaluation of eight state-of-the-art MLLMs reveals a prevalent tendency to misclassify non-metaphorical memes as metaphorical; even when predictions are correct, the generated explanations often diverge significantly from the original intent. This study establishes a new benchmark and offers critical insights into the challenges of multimodal metaphor comprehension.
📝 Abstract
Internet memes represent a popular form of multimodal online communication and often use figurative elements to convey layered meaning through the combination of text and images. However, it remains largely unclear how multimodal large language models (MLLMs) combine and interpret visual and textual information to identify figurative meaning in memes. To address this gap, we evaluate eight state-of-the-art generative MLLMs across three datasets on their ability to detect and explain six types of figurative meaning. In addition, we conduct a human evaluation of the explanations generated by these MLLMs, assessing whether the provided reasoning supports the predicted label and whether it remains faithful to the original meme content. Our findings indicate that all models exhibit a strong bias to associate a meme with figurative meaning, even when no such meaning is present. Qualitative analysis further shows that correct predictions are not always accompanied by faithful explanations.