SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Existing evaluation benchmarks inadequately assess large language models’ (LLMs) capabilities in authentic multilingual Southeast Asian (SEA) contexts due to overreliance on English-centric or machine-translated data. Method: We introduce two localized, human-curated benchmarks—SeaExam, comprising real regional educational examination items spanning history, literature, and other disciplines; and SeaBench, built upon multiround, open-ended community dialogues. Both employ natively generated multilingual questions (not English translations), with scenario-driven data collection, structured item annotation, explicit modeling of conversational turn-taking, and a cross-lingual evaluation framework. Contribution/Results: Experiments demonstrate that SeaExam and SeaBench significantly outperform mainstream translation-based benchmarks in sensitivity and discriminative power for evaluating LLMs’ genuine SEA language proficiency. They establish a new paradigm for multilingual LLM evaluation grounded in local linguistic authenticity, pedagogical relevance, and sociolinguistic realism.

Technology Category

Application Category

📝 Abstract

This study introduces two novel benchmarks, SeaExam and SeaBench, designed to evaluate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios. Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real-world scenarios from SEA regions. SeaExam draws from regional educational exams to form a comprehensive dataset that encompasses subjects such as local history and literature. In contrast, SeaBench is crafted around multi-turn, open-ended tasks that reflect daily interactions within SEA communities. Our evaluations demonstrate that SeaExam and SeaBench more effectively discern LLM performance on SEA language tasks compared to their translated benchmarks. This highlights the importance of using real-world queries to assess the multilingual capabilities of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs in Southeast Asian scenarios

Assessing multilingual capabilities with real-world queries

Benchmarking local history and literature knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Local multilingual benchmark creation

Real-world SEA scenario integration

Effective LLM performance evaluation

🔎 Similar Papers

Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models