🤖 AI Summary
Current large vision-language models (LVLMs) exhibit limited capability in understanding context-dependent meme intent, often disregarding conversational cues or overemphasizing visual details—leading to failures in communicative purpose identification. To address this, we introduce MemeContext, the first context-aware benchmark for meme understanding, specifically designed to evaluate how meme intent dynamically shifts across distinct Reddit discussion contexts. We construct a five-source, real-world multimodal dataset comprising original posts, threaded comments, and community engagement signals, and formally define and quantify the “context-dependent meme understanding” task. Our annotation schema includes fine-grained, three-dimensional labels: intent, structural composition, and response alignment. Leveraging human-in-the-loop annotation and a context-sensitive evaluation protocol—spanning intent recognition, image-text alignment, and community-aware modeling—we expose systematic deficiencies in LVLMs’ cross-modal contextual integration. MemeContext provides a reproducible, quantitative foundation for diagnosing meme comprehension capabilities and guiding model improvement.
📝 Abstract
Memes have emerged as a popular form of multimodal online communication, where their interpretation heavily depends on the specific context in which they appear. Current approaches predominantly focus on isolated meme analysis, either for harmful content detection or standalone interpretation, overlooking a fundamental challenge: the same meme can express different intents depending on its conversational context. This oversight creates an evaluation gap: although humans intuitively recognize how context shapes meme interpretation, Large Vision Language Models (LVLMs) can hardly understand context-dependent meme intent. To address this critical limitation, we introduce MemeReaCon, a novel benchmark specifically designed to evaluate how LVLMs understand memes in their original context. We collected memes from five different Reddit communities, keeping each meme's image, the post text, and user comments together. We carefully labeled how the text and meme work together, what the poster intended, how the meme is structured, and how the community responded. Our tests with leading LVLMs show a clear weakness: models either fail to interpret critical information in the contexts, or overly focus on visual details while overlooking communicative purpose. MemeReaCon thus serves both as a diagnostic tool exposing current limitations and as a challenging benchmark to drive development toward more sophisticated LVLMs of the context-aware understanding.