AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs'Contextual and Cultural Knowledge and Thinking

📅 2026-01-25

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This study addresses the significant limitations of current AI models in comprehending the cultural context and emotional nuances of text-free audiovisual memes. To bridge this gap, the authors construct a multimodal benchmark comprising over a thousand iconic audiovisual memes—including speech, songs, music, and sound effects—curated with human annotation and enriched with structured metadata such as year, transcription, summary, and sensitivity tags. The work introduces a hierarchical question-answering framework that progresses from surface-level content to deep cultural interpretation. Notably, it presents the first systematic evaluation framework integrating cultural, contextual, and affective dimensions into multimodal meme understanding, supporting fine-grained assessment across languages, cultures, and modalities. Experimental results reveal that state-of-the-art multimodal large language models substantially underperform humans in text-free audio comprehension and cultural reasoning, highlighting a critical gap in cultural alignment.

Technology Category

Application Category

📝 Abstract

Internet audio-visual clips convey meaning through time-varying sound and motion, which extend beyond what text alone can represent. To examine whether AI models can understand such signals in human cultural contexts, we introduce AVMeme Exam, a human-curated benchmark of over one thousand iconic Internet sounds and videos spanning speech, songs, music, and sound effects. Each meme is paired with a unique Q&A assessing levels of understanding from surface content to context and emotion to usage and world knowledge, along with metadata such as original year, transcript, summary, and sensitivity. We systematically evaluate state-of-the-art multimodal large language models (MLLMs) alongside human participants using this benchmark. Our results reveal a consistent limitation: current models perform poorly on textless music and sound effects, and struggle to think in context and in culture compared to surface content. These findings highlight a key gap in human-aligned multimodal intelligence and call for models that can perceive contextually and culturally beyond the surface of what they hear and see. Project page: avmemeexam.github.io/public

Problem

Research questions and friction points this paper is trying to address.

multimodal

cultural understanding

contextual reasoning

audio-visual memes

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal benchmark

cultural understanding

audio-visual memes

contextual reasoning

multilingual evaluation

🔎 Similar Papers

CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs

2024-10-03arXiv.orgCitations: 25