🤖 AI Summary
Existing evaluation benchmarks inadequately assess large language models’ (LLMs) capabilities in authentic multilingual Southeast Asian (SEA) contexts due to overreliance on English-centric or machine-translated data. Method: We introduce two localized, human-curated benchmarks—SeaExam, comprising real regional educational examination items spanning history, literature, and other disciplines; and SeaBench, built upon multiround, open-ended community dialogues. Both employ natively generated multilingual questions (not English translations), with scenario-driven data collection, structured item annotation, explicit modeling of conversational turn-taking, and a cross-lingual evaluation framework. Contribution/Results: Experiments demonstrate that SeaExam and SeaBench significantly outperform mainstream translation-based benchmarks in sensitivity and discriminative power for evaluating LLMs’ genuine SEA language proficiency. They establish a new paradigm for multilingual LLM evaluation grounded in local linguistic authenticity, pedagogical relevance, and sociolinguistic realism.
📝 Abstract
This study introduces two novel benchmarks, SeaExam and SeaBench, designed to evaluate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios. Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real-world scenarios from SEA regions. SeaExam draws from regional educational exams to form a comprehensive dataset that encompasses subjects such as local history and literature. In contrast, SeaBench is crafted around multi-turn, open-ended tasks that reflect daily interactions within SEA communities. Our evaluations demonstrate that SeaExam and SeaBench more effectively discern LLM performance on SEA language tasks compared to their translated benchmarks. This highlights the importance of using real-world queries to assess the multilingual capabilities of LLMs.