Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Southeast Asia (SEA) remains severely underrepresented in vision-language (VL) research, hindering AI models’ ability to capture its rich cultural and linguistic diversity. To address this, we introduce SEA-VL—the first open-source VL dataset and systematic construction framework tailored to SEA. Our methodology integrates localized crowdsourcing, culturally aware web crawling, and generative image exploration via Stable Diffusion, augmented by human cultural validation. We conduct the first systematic evaluation of three data acquisition strategies in terms of cultural relevance and feasibility: web crawling achieves 85% cultural relevance at low cost, whereas generative methods exhibit significant limitations in modeling fine-grained cultural semantics. The resulting SEA-VL dataset comprises 1.28 million culturally relevant images spanning 11 SEA countries and over 10 languages—50× larger than existing comparable VL datasets. SEA-VL is fully open-sourced to support multilingual, multicultural VL model training and evaluation.

Technology Category

Application Category

📝 Abstract
Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA languages. By involving contributors from SEA countries, SEA-VL aims to ensure better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages in VL research. Beyond crowdsourcing, our initiative goes one step further in the exploration of the automatic collection of culturally relevant images through crawling and image generation. First, we find that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing. Second, despite the substantial progress in generative vision models, synthetic images remain unreliable in accurately reflecting SEA cultures. The generated images often fail to reflect the nuanced traditions and cultural contexts of the region. Collectively, we gather 1.28M SEA culturally-relevant images, more than 50 times larger than other existing datasets. Through SEA-VL, we aim to bridge the representation gap in SEA, fostering the development of more inclusive AI systems that authentically represent diverse cultures across SEA.
Problem

Research questions and friction points this paper is trying to address.

Address underrepresentation of Southeast Asia in vision-language research.
Develop culturally relevant dataset for Southeast Asian languages.
Evaluate methods for collecting culturally accurate images.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Crowdsourcing for cultural relevance and diversity
Image crawling achieves ~85% cultural relevance
Generated images unreliable for SEA cultural nuances
🔎 Similar Papers
Samuel Cahyawijaya
Samuel Cahyawijaya
Cohere
Low-Resource NLPUnderrepresented LanguagesMultilingualCosslingualZero/Few-shot learning
Holy Lovenia
Holy Lovenia
SEACrowd
Multimodal & multilingual
J
Joel Ruben Antony Moniz
Mila - Quebec AI Institute, Polytechnique Montreal
Mohammad Rifqi Farhansyah
Mohammad Rifqi Farhansyah
Researcher, Monash University
Natural Language ProcessingLarge Language ModelsLow Resource Language
T
Thant Thiri Maung
Ton Duc Thang University
Frederikus Hudi
Frederikus Hudi
Nara Institute of Science and Technology
Machine TranslationMultilingualityLow-Resource NLP
David Anugraha
David Anugraha
Stanford University
Machine LearningNatural Language ProcessingMultimodalityArtificial Intelligence
M
Muhammad Ravi Shulthan Habibi
University of Indonesia, SEACrowd, IndoNLP
M
M. Qorib
National University of Singapore
A
Amit Agarwal
Oracle
J
Joseph Marvin Imperial
University of Bath, National University Philippines
H
H. Patel
Oracle
V
Vicky Feliren
Monash University, Indonesia
B
B. I. Nasution
The University of Manchester
M
Manuel Antonio Rufino
Samsung R&D Institute Philippines
Genta Indra Winata
Genta Indra Winata
Capital One AI Foundations
MultilingualityLanguage ModelingMultimodalLow-resource NLPCode-Switching
R
R. A. Rajagede
Universitas Islam Indonesia
C
Carlos Rafael Catalan
Samsung R&D Institute Philippines
M
Mohamed Fazli Imam
MBZUAI
Priyaranjan Pattnayak
Priyaranjan Pattnayak
Oracle Cloud Gen AI & University of Washington - Seattle
NLPMachine LearningDeep LearningGenerative AI
S
Salsabila Zahirah Pranida
MBZUAI
K
Kevin Pratama
Meta
Yeshil Bangera
Yeshil Bangera
University of New Haven
Machine LearningDeep LearningData EngineeringData Analytics
A
Adisai Na-Thalang
SCB 10X
P
Patricia Nicole Monderin
Samsung R&D Institute Philippines
Yueqi Song
Yueqi Song
BS/MS student, Carnegie Mellon University
AI AgentsMultimodal NLPMultilingual NLP
Lynnette Hui Xian Ng
Lynnette Hui Xian Ng
Societal Computing PhD Student at Carnegie Mellon University
Societal ComputingComputational Social ScienceSocial Network Analysis
R
Richardy Lobo' Sapan
University of Indonesia
Taki Hasan Rafi
Taki Hasan Rafi
Hanyang University
Trustworthy AIAI SafetyMachine LearningAI in Health
B
Bin Wang
Institute for Infocomm Research, Singapore
S
Supryadi
Tianjin University
K
Kanyakorn Veerakanjana
Faculty of Medicine Siriraj Hospital, Mahidol University
P
Piyalitt Ittichaiwong
Faculty of Medicine Siriraj Hospital, Mahidol University
M
Matthew Theodore Roque
Samsung R&D Institute Philippines
K
Karissa Vincentio
Binus University
T
Takdanai Kreangphet
Srinakharinwirot University
P
Phakphum Artkaew
New York University
K
K. Palgunadi
Institut Teknologi Sepuluh Nopember
Y
Yanzhi Yu
Macau University of Science and Technology
R
Rochana Prih Hastuti
Universitas Gadjah Mada
W
William Nixon
Bandung Institute of Technology
M
Mithil Bangera
University of New Haven
A
A. X. W. Lim
National University of Singapore
A
A. Khine
King Mongkut’s University of Technology Thonburi
H
Hanif Muhammad Zhafran
Bandung Institute of Technology
T
Teddy Ferdinan
Wrocław Tech
A
Audra Aurora Izzani
University of Illiinois, Urbana-Champaign
A
Ayushman Singh
SEACrowd, IndoNLP
E
Evan
Sony Group Corporation
J
Jauza Akbar Krito
Sony Group Corporation
M
Michael Anugraha
Sony Group Corporation
F
Fenal Ashokbhai Ilasariya
Sony Group Corporation
Haochen Li
Haochen Li
Tsinghua university
cell-cell communicationsingle-cell genomicsspatial transcriptomics
J
John Amadeo Daniswara
Sony Group Corporation
F
Filbert Aurelian Tjiaranata
University of Indonesia
E
Eryawan Presma Yulianrifat
University of Indonesia
Can Udomcharoenchaikit
Can Udomcharoenchaikit
VISTEC
NLP
F
Fadil Risdian Ansori
Sony Group Corporation
Mahardika Krisna Ihsani
Mahardika Krisna Ihsani
MBZUAI
Natural Language ProcessingMachine LearningInterpretabilityComputational Linguistics
G
Giang Nguyen
Auburn University
A
A. Barik
National University of Singapore
Dan John Velasco
Dan John Velasco
Samsung Research Philippines
Natural Language ProcessingDeep Learning
R
Rifo Ahmad Genadi
MBZUAI
Chengwei Wei
Chengwei Wei
Research Scientist, Institute for Infocomm Research, A*STAR
Natural Language Processing
I
Isaiah Flores
Ateneo de Manila University
K
Kenneth Ko Han Chen
Singapore Polytechnic
A
Anjela Gail Santos
University of the Philippines
Wan Shen Lim
Wan Shen Lim
Carnegie Mellon University
DatabasesDatabase Systems
K
Kaung Si Phyo
Singapore Polytechnic
M
M. Dwiastuti
Universitas Gadjah Mada
J
Jiayun Luo
Sony Group Corporation
Jan Christian Blaise Cruz
Jan Christian Blaise Cruz
MBZUAI, McGill University, Mila - Quebec AI Institute
Natural Language ProcessingTranslationMultilingualityLow-resource LanguagesCode Switching
Ming Shan Hee
Ming Shan Hee
Singapore University of Technology and Design
multimodalhate speechmemes
I
Ikhlasul Akmal Hanif
SEACrowd, IndoNLP
M
M.Alif Al Hakim
SEACrowd, IndoNLP
M
Muhammad Rizky Sya'ban
Bandung Institute of Technology
K
Kun Kerdthaisong
Thammasat University
Lester James V. Miranda
Lester James V. Miranda
University of Cambridge
Natural Language ProcessingMachine Learning
Fajri Koto
Fajri Koto
Assistant Professor (tenure-track), MBZUAI
Computational LinguisticsNatural Language ProcessingMultilingual NLPHuman-centered NLP
T
Tirana Noor Fatyanosa
Brawijaya University
Alham Fikri Aji
Alham Fikri Aji
MBZUAI, Monash Indonesia
MultilingualityLow-resource NLPLanguage ModelingMachine Translation
J
Jostin Jerico Rosal
Seoul National University of Science and Technology
J
Jun Kevin
Universitas Pelita Harapan
Robert Wijaya
Robert Wijaya
PhD Student, Singapore University of Technology and Design
Computer VisionVision-LanguageAI
Onno P. Kampman
Onno P. Kampman
University of Cambridge, MOHT
natural language processingdigital mental healthcognitive neurosciencemachine learning
Ruochen Zhang
Ruochen Zhang
Brown University
Multilingual NLPInterpretabilityCode-Switching
B
Borje F. Karlsson
Beijing Academy of Artificial Intelligence (BAAI)
Peerat Limkonchotiwat
Peerat Limkonchotiwat
Research Fellow, AI Singapore, National University of Singapore
Evaluation and BenchmarkRepresentation LearningLarge Language ModelMultilingual Learning