BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models

📅 2026-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the absence of a comprehensive benchmark for systematically evaluating large language models (LLMs) in Burmese across core capabilities—understanding, reasoning, and generation. We present the first multidimensional evaluation suite for Burmese, encompassing seven linguistically and culturally grounded subtasks developed through a native-speaker-driven annotation process to ensure linguistic naturalness and cultural authenticity. This benchmark establishes the inaugural multidimensional NLP evaluation framework for Burmese, filling critical gaps in several subtask domains. We conduct large-scale evaluations of both open-source and commercial LLMs, revealing that model architecture, language representation, and instruction tuning exert a significantly greater impact on performance in low-resource settings than model scale alone. Notably, regional fine-tuning targeting Southeast Asian languages and adoption of next-generation models substantially enhance Burmese capabilities. Results are released via a public leaderboard to foster sustained research on low-resource languages.

Technology Category

Application Category

📝 Abstract
We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG). BURMESE-SAN consolidates seven subtasks spanning these competencies, including Question Answering, Sentiment Analysis, Toxicity Detection, Causal Reasoning, Natural Language Inference, Abstractive Summarization, and Machine Translation, several of which were previously unavailable for Burmese. The benchmark is constructed through a rigorous native-speaker-driven process to ensure linguistic naturalness, fluency, and cultural authenticity while minimizing translation-induced artifacts. We conduct a large-scale evaluation of both open-weight and commercial LLMs to examine challenges in Burmese modeling arising from limited pretraining coverage, rich morphology, and syntactic variation. Our results show that Burmese performance depends more on architectural design, language representation, and instruction tuning than on model scale alone. In particular, Southeast Asia regional fine-tuning and newer model generations yield substantial gains. Finally, we release BURMESE-SAN as a public leaderboard to support systematic evaluation and sustained progress in Burmese and other low-resource languages. https://leaderboard.sea-lion.ai/detailed/MY
Problem

Research questions and friction points this paper is trying to address.

Burmese NLP
large language models
evaluation benchmark
low-resource languages
NLU/NLR/NLG
Innovation

Methods, ideas, or system contributions that make the work stand out.

Burmese NLP
large language models
low-resource languages
multitask benchmark
native-speaker curation
🔎 Similar Papers
No similar papers found.
T
Thura Aung
King Mongkut’s Institute of Technology Ladkrabang
J
Jann Railey Montalan
AI Singapore, National University of Singapore
J
Jian Gang Ngui
AI Singapore, National University of Singapore
Peerat Limkonchotiwat
Peerat Limkonchotiwat
Research Fellow, AI Singapore, National University of Singapore
Evaluation and BenchmarkRepresentation LearningLarge Language ModelMultilingual Learning