SEA-BED: Southeast Asia Embedding Benchmark

📅 2025-08-17

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Southeast Asia lacks a sentence embedding benchmark tailored to local linguistic characteristics. Method: We introduce SEA-BED—the first regional benchmark covering 10 Southeast Asian languages, 9 task categories, and 169 datasets, of which 71% are human-annotated (significantly enhancing evaluation reliability for low-resource languages). Unlike existing multilingual benchmarks (e.g., MTEB) that rely heavily on machine translation, SEA-BED prioritizes native semantic properties and high-quality human annotations. We systematically evaluate 17 state-of-the-art embedding models under a multitask framework—including semantic textual similarity, retrieval, and classification—on both original and translated texts. Contribution/Results: Our evaluation reveals substantial cross-lingual performance inconsistency and volatile model rankings across languages. Crucially, it empirically demonstrates the irreplaceable role of human annotation in low-resource language evaluation. This work provides the first large-scale empirical validation of the necessity and efficacy of region-specific embedding benchmarks.

Technology Category

Application Category

📝 Abstract

Sentence embeddings are essential for NLP tasks such as semantic search, re-ranking, and textual similarity. Although multilingual benchmarks like MMTEB broaden coverage, Southeast Asia (SEA) datasets are scarce and often machine-translated, missing native linguistic properties. With nearly 700 million speakers, the SEA region lacks a region-specific embedding benchmark. We introduce SEA-BED, the first large-scale SEA embedding benchmark with 169 datasets across 9 tasks and 10 languages, where 71% are formulated by humans, not machine generation or translation. We address three research questions: (1) which SEA languages and tasks are challenging, (2) whether SEA languages show unique performance gaps globally, and (3) how human vs. machine translations affect evaluation. We evaluate 17 embedding models across six studies, analyzing task and language challenges, cross-benchmark comparisons, and translation trade-offs. Results show sharp ranking shifts, inconsistent model performance among SEA languages, and the importance of human-curated datasets for low-resource languages like Burmese.

Problem

Research questions and friction points this paper is trying to address.

Lack of Southeast Asia-specific embedding benchmark for NLP tasks

Scarcity of native SEA datasets, often relying on machine translations

Need to evaluate model performance gaps and translation impacts in SEA languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

First large-scale SEA embedding benchmark

Human-curated datasets for 10 languages

Evaluates 17 models across six studies

🔎 Similar Papers

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages