SEA-BED: Southeast Asia Embedding Benchmark

📅 2025-08-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Southeast Asia lacks a sentence embedding benchmark tailored to local linguistic characteristics. Method: We introduce SEA-BED—the first regional benchmark covering 10 Southeast Asian languages, 9 task categories, and 169 datasets, of which 71% are human-annotated (significantly enhancing evaluation reliability for low-resource languages). Unlike existing multilingual benchmarks (e.g., MTEB) that rely heavily on machine translation, SEA-BED prioritizes native semantic properties and high-quality human annotations. We systematically evaluate 17 state-of-the-art embedding models under a multitask framework—including semantic textual similarity, retrieval, and classification—on both original and translated texts. Contribution/Results: Our evaluation reveals substantial cross-lingual performance inconsistency and volatile model rankings across languages. Crucially, it empirically demonstrates the irreplaceable role of human annotation in low-resource language evaluation. This work provides the first large-scale empirical validation of the necessity and efficacy of region-specific embedding benchmarks.

Technology Category

Application Category

📝 Abstract
Sentence embeddings are essential for NLP tasks such as semantic search, re-ranking, and textual similarity. Although multilingual benchmarks like MMTEB broaden coverage, Southeast Asia (SEA) datasets are scarce and often machine-translated, missing native linguistic properties. With nearly 700 million speakers, the SEA region lacks a region-specific embedding benchmark. We introduce SEA-BED, the first large-scale SEA embedding benchmark with 169 datasets across 9 tasks and 10 languages, where 71% are formulated by humans, not machine generation or translation. We address three research questions: (1) which SEA languages and tasks are challenging, (2) whether SEA languages show unique performance gaps globally, and (3) how human vs. machine translations affect evaluation. We evaluate 17 embedding models across six studies, analyzing task and language challenges, cross-benchmark comparisons, and translation trade-offs. Results show sharp ranking shifts, inconsistent model performance among SEA languages, and the importance of human-curated datasets for low-resource languages like Burmese.
Problem

Research questions and friction points this paper is trying to address.

Lack of Southeast Asia-specific embedding benchmark for NLP tasks
Scarcity of native SEA datasets, often relying on machine translations
Need to evaluate model performance gaps and translation impacts in SEA languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

First large-scale SEA embedding benchmark
Human-curated datasets for 10 languages
Evaluates 17 models across six studies
🔎 Similar Papers