WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

📅 2024-10-16
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) exhibit significant limitations in non-English contexts and in understanding marginalized cultural knowledge. Method: We introduce the largest multilingual, multicul­tural visual question answering benchmark to date—covering 30 languages/dialects across 9 language families and over 1 million image–text pairs—centered on dish name recognition and geographic provenance inference. We propose fine-grained regional annotations, cross-cultural image labeling, region-aware prompting, and adversarial context evaluation. Contribution/Results: This work presents the first systematic assessment of VLMs’ cross-lingual and cross-regional comprehension in global culinary culture. Results reveal improved performance under standard geographic contexts but persistent deficiencies in adversarial settings and fine-grained regional or language prediction. We publicly release the benchmark—including 12K/60K evaluation splits and a 1M-sample training set—alongside a fine-grained culinary knowledge base and an adversarial test suite, establishing a new paradigm for culturally equitable VLM evaluation.

Technology Category

Application Category

📝 Abstract
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.
Problem

Research questions and friction points this paper is trying to address.

Evaluates VLMs on culture-specific multilingual knowledge
Assesses VLM performance in underrepresented cultural contexts
Challenges VLMs with regional cuisines and language predictions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual VQA dataset
30 languages and dialects
1 million data points
🔎 Similar Papers
Genta Indra Winata
Genta Indra Winata
Capital One AI Foundations
MultilingualityLanguage ModelingMultimodalLow-resource NLPCode-Switching
Frederikus Hudi
Frederikus Hudi
Nara Institute of Science and Technology
Machine TranslationMultilingualityLow-Resource NLP
Patrick Amadeus Irawan
Patrick Amadeus Irawan
MBZUAI, SMU
Natural Language ProcessingVision LanguageMultimodalityInterpretability
David Anugraha
David Anugraha
Stanford University
Machine LearningNatural Language ProcessingMultimodalityArtificial Intelligence
R
Rifki Afina Putri
SEACrowd, MBZUAI
Y
Yutong Wang
JAIST
Adam Nohejl
Adam Nohejl
Unknown affiliation
Natural Language ProcessingComputational PsycholinguisticsLexical Simplification
U
Ubaidillah Ariq Prathama
ITB
N
N. Ousidhoum
Cardiff University
A
Afifa Amriani
Independent
A
Anar Y. Rzayev
Independent
Anirban Das
Anirban Das
Assistant Professor, BITS Pilani
Internet of ThingsCyber Physical SystemsSentiment analysis
A
Ashmari Pramodya
NAIST
A
Aulia Adila
JAIST
Bryan Wilie
Bryan Wilie
Ph.D. Candidate, Hong Kong University of Science & Technology
ReasoningMultilingualismRetrieval-Augmented GenerationAgentic AI
C
Candy Olivia Mawalim
JAIST
C
Ching Lam Cheng
SMU
D
D. Abolade
Masakhane, University of Lagos
Emmanuele Chersoni
Emmanuele Chersoni
Hong Kong Polytechnic University
Computational Linguistics
Enrico Santus
Enrico Santus
Bloomberg, CTO AI
Natural Language Processing for Finance
F
Fariz Ikhwantri
Independent
Garry Kuwanto
Garry Kuwanto
Boston University
H
Hanyang Zhao
Columbia University
Haryo Akbarianto Wibowo
Haryo Akbarianto Wibowo
MBZUAI
Natural Language Processing
Holy Lovenia
Holy Lovenia
SEACrowd
Multimodal & multilingual
Jan Christian Blaise Cruz
Jan Christian Blaise Cruz
MBZUAI, McGill University, Mila - Quebec AI Institute
Natural Language ProcessingTranslationMultilingualityLow-resource LanguagesCode Switching
J
Jan Wira Gotama Putra
Independent
Junho Myung
Junho Myung
KAIST
NLPHCI
Lucky Susanto
Lucky Susanto
Monash University Indonesia
Natural Language ProcessingMachine LearningNeural Machine TranslationLow Resrouce Settings
M
Maria Angelica Riera Machin
NAIST
M
Marina Zhukova
UCSB
M
Michael Anugraha
Independent
M
Muhammad Farid Adilazuarda
SEACrowd, MBZUAI
N
Natasha Santosa
Tokyo Tech
Peerat Limkonchotiwat
Peerat Limkonchotiwat
Research Fellow, AI Singapore, National University of Singapore
Evaluation and BenchmarkRepresentation LearningLarge Language ModelMultilingual Learning
Raj Dabre
Raj Dabre
Researcher@NICT (Japan), Adjunct Faculty@IIT Madras/AI4Bharat (India)
Artificial IntelligenceMachine TranslationNatural Language ProcessingGenetics
R
Rio Alexander Audino
ITB
Samuel Cahyawijaya
Samuel Cahyawijaya
Cohere
Low-Resource NLPUnderrepresented LanguagesMultilingualCosslingualZero/Few-shot learning
S
Shi-Xiong Zhang
Capital One
S
Stephanie Yulia Salim
JAIST
Y
Yi Zhou
Cardiff University
Y
Yinxuan Gui
SMU
D
D. Adelani
Masakhane, McGill, MILA
En-Shiun Annie Lee
En-Shiun Annie Lee
Ontario Tech University, and University of Toronto (Status-Only)
Natural Language ProcessingData MiningPattern Analysis
Shogo Okada
Shogo Okada
JAIST
Ayu Purwarianti
Ayu Purwarianti
Associate Professor, Informatics, Institut Teknologi Bandung, Indonesia
Computational LinguisticsMachine Learning
Alham Fikri Aji
Alham Fikri Aji
MBZUAI, Monash Indonesia
MultilingualityLow-resource NLPLanguage ModelingMachine Translation
Taro Watanabe
Taro Watanabe
Nara Institute of Science and Technology
Machine TranslationMachine Learning
Derry Tanti Wijaya
Derry Tanti Wijaya
Boston University, Monash University Indonesia
Natural Language ProcessingMachine LearningInformation Extraction
Alice Oh
Alice Oh
KAIST Computer Science
machine learningNLPcomputational social science
Chong-Wah Ngo
Chong-Wah Ngo
Singapore Management University
MultimediaFood ComputingComputer VisionInformation Retrieval