AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs'Contextual and Cultural Knowledge and Thinking

📅 2026-01-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the significant limitations of current AI models in comprehending the cultural context and emotional nuances of text-free audiovisual memes. To bridge this gap, the authors construct a multimodal benchmark comprising over a thousand iconic audiovisual memes—including speech, songs, music, and sound effects—curated with human annotation and enriched with structured metadata such as year, transcription, summary, and sensitivity tags. The work introduces a hierarchical question-answering framework that progresses from surface-level content to deep cultural interpretation. Notably, it presents the first systematic evaluation framework integrating cultural, contextual, and affective dimensions into multimodal meme understanding, supporting fine-grained assessment across languages, cultures, and modalities. Experimental results reveal that state-of-the-art multimodal large language models substantially underperform humans in text-free audio comprehension and cultural reasoning, highlighting a critical gap in cultural alignment.

Technology Category

Application Category

📝 Abstract
Internet audio-visual clips convey meaning through time-varying sound and motion, which extend beyond what text alone can represent. To examine whether AI models can understand such signals in human cultural contexts, we introduce AVMeme Exam, a human-curated benchmark of over one thousand iconic Internet sounds and videos spanning speech, songs, music, and sound effects. Each meme is paired with a unique Q&A assessing levels of understanding from surface content to context and emotion to usage and world knowledge, along with metadata such as original year, transcript, summary, and sensitivity. We systematically evaluate state-of-the-art multimodal large language models (MLLMs) alongside human participants using this benchmark. Our results reveal a consistent limitation: current models perform poorly on textless music and sound effects, and struggle to think in context and in culture compared to surface content. These findings highlight a key gap in human-aligned multimodal intelligence and call for models that can perceive contextually and culturally beyond the surface of what they hear and see. Project page: avmemeexam.github.io/public
Problem

Research questions and friction points this paper is trying to address.

multimodal
cultural understanding
contextual reasoning
audio-visual memes
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal benchmark
cultural understanding
audio-visual memes
contextual reasoning
multilingual evaluation
🔎 Similar Papers
No similar papers found.
Xilin Jiang
Xilin Jiang
PhD student, Columbia University
Speech and AudioMachine ListeningMachine PerceptionMultimodal LLMBrain-Computer Interface
Q
Qiaolin Wang
Columbia University
Junkai Wu
Junkai Wu
University of Washington
speech processingaudio processing
Xiaomin He
Xiaomin He
Columbia University
Zhongweiyang Xu
Zhongweiyang Xu
University of Illinois Urbana-Champaign
Generative ModelArray Signal ProcessingSpeech Processing
Yinghao Ma
Yinghao Ma
PhD candidate, Centre for Digital Music (C4DM), Queen Mary University of London
Music Information RetrievalLarge Language ModelsMultimodal LearningAudio Signal Processing
M
Minshuo Piao
Johns Hopkins University
K
Kaiyi Yang
Johns Hopkins University
X
Xiuwen Zheng
Johns Hopkins University
Riki Shimizu
Riki Shimizu
Columbia University
Y
Yicong Chen
University of Washington
Arsalan Firoozi
Arsalan Firoozi
Columbia University
Speech Neuroscience
Gavin Mischler
Gavin Mischler
PhD Student at Columbia University
Computational NeuroscienceComputational MedicineNeurolinguisticsMachine Learning
Sukru Samet Dindar
Sukru Samet Dindar
Columbia University
Brain-Computer InterfacesAudio and SpeechLarge Language ModelsAuditory Neuroscience
R
Richard J. Antonello
Johns Hopkins University
L
Linyang He
Johns Hopkins University
Tsun-An Hsieh
Tsun-An Hsieh
PhD Student at UIUC
Deep LearningSpeech Processing
Xulin Fan
Xulin Fan
University of Illinois at Urbana-Champaign
Machine LearningSpeech Processing
Y
Yulun Wu
Johns Hopkins University
Y
Yuesheng Ma
Johns Hopkins University
C
Chaitanya Amballa
Johns Hopkins University
W
Weixiong Chen
Johns Hopkins University
Jiarui Hai
Jiarui Hai
Johns Hopkins University
computer auditiongenerative modelsmusic information retrieval
R
Ruisi Li
Johns Hopkins University
Vishal Choudhari
Vishal Choudhari
Electrical Engineering Ph.D. Candidate, Columbia University
Multimodal SystemsLarge Language ModelsSpeech and AudioBrain-Computer Interfaces
Cong Han
Cong Han
Google, Columbia University
Audio and speechBrain-computer interface
Yinghao Aaron Li
Yinghao Aaron Li
PhD Student, Columbia University
Computational NeuroscienceVoice ConversionSpeech Synthesis
A
A. Flinker
Columbia University
Mounya Elhilali
Mounya Elhilali
Professor of electrical and computer engineering, the johns hopkins university
Emmanouil Benetos
Emmanouil Benetos
Queen Mary University of London
Machine listeningAudio signal processingMusic information retrievalMachine learning
M
M. Hasegawa-Johnson
University of Illinois Urbana-Champaign
Romit Roy Choudhury
Romit Roy Choudhury
Professor of ECE and CS, University of Illinois at Urbana Champaign, UIUC
Wireless networkingmobile computingsensingsignal processing
N
N. Mesgarani
Columbia University