Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Southeast Asia (SEA) remains severely underrepresented in vision-language (VL) research, hindering AI models’ ability to capture its rich cultural and linguistic diversity. To address this, we introduce SEA-VL—the first open-source VL dataset and systematic construction framework tailored to SEA. Our methodology integrates localized crowdsourcing, culturally aware web crawling, and generative image exploration via Stable Diffusion, augmented by human cultural validation. We conduct the first systematic evaluation of three data acquisition strategies in terms of cultural relevance and feasibility: web crawling achieves 85% cultural relevance at low cost, whereas generative methods exhibit significant limitations in modeling fine-grained cultural semantics. The resulting SEA-VL dataset comprises 1.28 million culturally relevant images spanning 11 SEA countries and over 10 languages—50× larger than existing comparable VL datasets. SEA-VL is fully open-sourced to support multilingual, multicultural VL model training and evaluation.

Technology Category

Application Category

📝 Abstract

Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA languages. By involving contributors from SEA countries, SEA-VL aims to ensure better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages in VL research. Beyond crowdsourcing, our initiative goes one step further in the exploration of the automatic collection of culturally relevant images through crawling and image generation. First, we find that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing. Second, despite the substantial progress in generative vision models, synthetic images remain unreliable in accurately reflecting SEA cultures. The generated images often fail to reflect the nuanced traditions and cultural contexts of the region. Collectively, we gather 1.28M SEA culturally-relevant images, more than 50 times larger than other existing datasets. Through SEA-VL, we aim to bridge the representation gap in SEA, fostering the development of more inclusive AI systems that authentically represent diverse cultures across SEA.

Problem

Research questions and friction points this paper is trying to address.

Address underrepresentation of Southeast Asia in vision-language research.

Develop culturally relevant dataset for Southeast Asian languages.

Evaluate methods for collecting culturally accurate images.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Crowdsourcing for cultural relevance and diversity

Image crawling achieves ~85% cultural relevance

Generated images unreliable for SEA cultural nuances

🔎 Similar Papers

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages