SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation benchmarks inadequately assess large language models’ (LLMs) capabilities in authentic multilingual Southeast Asian (SEA) contexts due to overreliance on English-centric or machine-translated data. Method: We introduce two localized, human-curated benchmarks—SeaExam, comprising real regional educational examination items spanning history, literature, and other disciplines; and SeaBench, built upon multiround, open-ended community dialogues. Both employ natively generated multilingual questions (not English translations), with scenario-driven data collection, structured item annotation, explicit modeling of conversational turn-taking, and a cross-lingual evaluation framework. Contribution/Results: Experiments demonstrate that SeaExam and SeaBench significantly outperform mainstream translation-based benchmarks in sensitivity and discriminative power for evaluating LLMs’ genuine SEA language proficiency. They establish a new paradigm for multilingual LLM evaluation grounded in local linguistic authenticity, pedagogical relevance, and sociolinguistic realism.

Technology Category

Application Category

📝 Abstract
This study introduces two novel benchmarks, SeaExam and SeaBench, designed to evaluate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios. Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real-world scenarios from SEA regions. SeaExam draws from regional educational exams to form a comprehensive dataset that encompasses subjects such as local history and literature. In contrast, SeaBench is crafted around multi-turn, open-ended tasks that reflect daily interactions within SEA communities. Our evaluations demonstrate that SeaExam and SeaBench more effectively discern LLM performance on SEA language tasks compared to their translated benchmarks. This highlights the importance of using real-world queries to assess the multilingual capabilities of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs in Southeast Asian scenarios
Assessing multilingual capabilities with real-world queries
Benchmarking local history and literature knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Local multilingual benchmark creation
Real-world SEA scenario integration
Effective LLM performance evaluation
Chaoqun Liu
Chaoqun Liu
Nanyang Technological University, Singapore
Multilingual LLMMulti-modal LLMLow-resource NLPLLM Evaluation
W
Wenxuan Zhang
DAMO Academy, Alibaba Group, Singapore; Hupan Lab, Hangzhou, China
J
Jiahao Ying
Singapore Management University
M
Mahani Aljunied
DAMO Academy, Alibaba Group, Singapore; Hupan Lab, Hangzhou, China
A
Anh Tuan Luu
Nanyang Technological University, Singapore
Lidong Bing
Lidong Bing
MiroMind, Alibaba DAMO, Tencent, CMU, CUHK
Natural Language ProcessingLarge Language ModelsLarge Multimodal Models