SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

๐Ÿ“… 2024-06-14
๐Ÿ›๏ธ Conference on Empirical Methods in Natural Language Processing
๐Ÿ“ˆ Citations: 6
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Southeast Asian (SEA) languages suffer from severe underrepresentation in AI multimodal datasets, leading to evaluation bias and cultural misrepresentation due to English dominance. Method: We introduce the first standardized multimodal (text/image/audio) dataset and benchmark covering nearly 1,000 SEA languages, enabling fair evaluation across 36 indigenous languages on 13 NLP, CV, and ASR tasks. We propose a holistic language collaboration paradigm, a cross-modal and cross-lingual unified annotation framework, and a culturally adaptive, technically equitable resource allocation mechanism. Leveraging multi-source crowdsourcing, linguistics-driven data cleaning, cross-modal alignment, and zero-shot transfer evaluation, we curate and release over 50 high-quality datasets. Results: Our benchmark yields an average 27.3% performance gain for mainstream models; it has driven SEA-language adaptation of 12 open-source models, all officially integrated into Hugging Face.

Technology Category

Application Category

๐Ÿ“ Abstract
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, through a collaborative movement, we introduce SEACrowd, a comprehensive resource center that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in Southeast Asia.
Problem

Research questions and friction points this paper is trying to address.

Addresses lack of AI datasets for Southeast Asian languages.
Evaluates AI models on 36 indigenous languages across 13 tasks.
Proposes strategies for AI advancements and resource equity in SEA.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual multimodal data hub
Standardized corpora for SEA languages
Benchmark suite for AI evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.
Holy Lovenia
Holy Lovenia
SEACrowd
Multimodal & multilingual
Rahmad Mahendra
Rahmad Mahendra
Universitas Indonesia and RMIT University
Natural Language ProcessingInformation ExtractionText MiningRecommender System
S
Salsabil Maulana Akbar
IndoNLP
Lester James V. Miranda
Lester James V. Miranda
University of Cambridge
Natural Language ProcessingMachine Learning
J
Jennifer Santoso
RevComm, Inc.
E
Elyanah Aco
Independent Researcher
A
Akhdan Fadhilah
Tohoku University
Jonibek Mansurov
Jonibek Mansurov
PhD student in NLP, MBZUAI
NLP
J
Joseph Marvin Imperial
University of Bath, National University Philippines
Onno P. Kampman
Onno P. Kampman
University of Cambridge, MOHT
natural language processingdigital mental healthcognitive neurosciencemachine learning
J
Joel Ruben Antony Moniz
Independent Researcher
M
Muhammad Ravi Shulthan Habibi
Universitas Indonesia, IndoNLP
Frederikus Hudi
Frederikus Hudi
Nara Institute of Science and Technology
Machine TranslationMultilingualityLow-Resource NLP
R
Railey Montalan
Independent Researcher
R
Ryan Ignatius
Independent Researcher
J
Joanito Agili Lopo
W
William Nixon
Institut Teknologi Bandung
Bรถrje F. Karlsson
Bรถrje F. Karlsson
Beijing Academy of Artificial Intelligence (BAAI)
Machine Learning SystemsIntelligent AgentsKnowledge MiningMobile ComputingMultilinguality
J
James Jaya
Independent Researcher
Ryandito Diandaru
Ryandito Diandaru
Master's Student, MBZUAI
NLP
Y
Yuze Gao
Independent Researcher
P
Patrick Amadeus
Institut Teknologi Bandung
B
Bin Wang
Independent Researcher
Jan Christian Blaise Cruz
Jan Christian Blaise Cruz
MBZUAI, McGill University, Mila - Quebec AI Institute
Natural Language ProcessingTranslationMultilingualityLow-resource LanguagesCode Switching
Chenxi Whitehouse
Chenxi Whitehouse
Research Scientist at Meta
Natural Language Processing
I
Ivan Halim Parmonangan
Queensland University of Technology
M
Maria Khelli
Institut Teknologi Bandung
W
Wenyu Zhang
Independent Researcher
Lucky Susanto
Lucky Susanto
Monash University Indonesia
Natural Language ProcessingMachine LearningNeural Machine TranslationLow Resrouce Settings
R
Reynard Adha Ryanda
S
Sonny Lazuardi Hermawan
Independent Design Engineer
Dan John Velasco
Dan John Velasco
Samsung Research Philippines
Natural Language ProcessingDeep Learning
Muhammad Dehan Al Kautsar
Muhammad Dehan Al Kautsar
Mohamed bin Zayed University of Artificial Intelligence
Natural Language ProcessingMultilingualityHuman-Centered NLP
W
Willy Fitra Hendria
Independent Researcher
Y
Y. Moslem
N
Noah Flynn
Amazon
M
Muhammad Farid Adilazuarda
MBZUAI
Haochen Li
Haochen Li
Tsinghua university
cell-cell communicationsingle-cell genomicsspatial transcriptomics
J
Johanes Lee
Institut Teknologi Bandung
R
R. Damanhuri
Universitas Diponegoro
Shuo Sun
Shuo Sun
Johns Hopkins University
M
M. Qorib
NUS
Amirbek Djanibekov
Amirbek Djanibekov
PhD Student MBZUAI
Natural Language ProcessingSpeech Processing
W
Wei Qi Leong
AI Singapore
Q
Quyet V. Do
HKUST
Niklas Muennighoff
Niklas Muennighoff
Stanford University
large language modelsartificial intelligencemachine learning
T
T. Pansuwan
University of Cambridge
Ilham Firdausi Putra
Ilham Firdausi Putra
Independent Researcher
Y
Yan Xu
Huawei Noahโ€™s Ark Lab, HKUST
N
Ngee Chia Tai
AI Singapore
Ayu Purwarianti
Ayu Purwarianti
Associate Professor, Informatics, Institut Teknologi Bandung, Indonesia
Computational LinguisticsMachine Learning
Sebastian Ruder
Sebastian Ruder
Research Scientist, Meta
Natural Language ProcessingMachine LearningDeep LearningArtificial Intelligence
W
William-Chandra Tjhi
AI Singapore
Peerat Limkonchotiwat
Peerat Limkonchotiwat
Research Fellow, AI Singapore, National University of Singapore
Evaluation and BenchmarkRepresentation LearningLarge Language ModelMultilingual Learning
Alham Fikri Aji
Alham Fikri Aji
MBZUAI, Monash Indonesia
MultilingualityLow-resource NLPLanguage ModelingMachine Translation
S
Sedrick Scott Keh
Independent Researcher
Genta Indra Winata
Genta Indra Winata
Capital One AI Foundations
MultilingualityLanguage ModelingMultimodalLow-resource NLPCode-Switching
Ruochen Zhang
Ruochen Zhang
Brown University
Multilingual NLPInterpretabilityCode-Switching
Fajri Koto
Fajri Koto
Assistant Professor (tenure-track), MBZUAI
Computational LinguisticsNatural Language ProcessingMultilingual NLPHuman-centered NLP
Zheng-Xin Yong
Zheng-Xin Yong
Brown University
Machine Learning
Samuel Cahyawijaya
Samuel Cahyawijaya
Cohere
Low-Resource NLPUnderrepresented LanguagesMultilingualCosslingualZero/Few-shot learning