KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Kazakh, a low-resource language, lags significantly in mainstream multilingual large language models (LLMs), and lacks a localized, discipline-diverse, authoritative evaluation benchmark. Method: We introduce KazMMLU—the first MMLU-style benchmark for Kazakh—comprising 23,000 bilingual (Kazakh–Russian) questions rigorously validated by native speakers and in-service teachers, deeply grounded in Kazakhstan’s indigenous knowledge and authentic pedagogical materials. KazMMLU is the first benchmark to systematically integrate bilingual education realities and regional cultural specificity, addressing a critical gap in low-resource language evaluation. Results: Cross-lingual evaluation on KazMMLU across state-of-the-art models—including Llama-3.1, Qwen-2.5, GPT-4, and DeepSeek-V3—reveals that even the best-performing models achieve substantially lower accuracy on Kazakh than on English, underscoring the urgent need for Kazakh-specific LLM development and establishing KazMMLU as a foundational resource for future research and localization efforts.

Technology Category

Application Category

📝 Abstract
Despite having a population of twenty million, Kazakhstan's culture and language remain underrepresented in the field of natural language processing. Although large language models (LLMs) continue to advance worldwide, progress in Kazakh language has been limited, as seen in the scarcity of dedicated models and benchmark evaluations. To address this gap, we introduce KazMMLU, the first MMLU-style dataset specifically designed for Kazakh language. KazMMLU comprises 23,000 questions that cover various educational levels, including STEM, humanities, and social sciences, sourced from authentic educational materials and manually validated by native speakers and educators. The dataset includes 10,969 Kazakh questions and 12,031 Russian questions, reflecting Kazakhstan's bilingual education system and rich local context. Our evaluation of several state-of-the-art multilingual models (Llama-3.1, Qwen-2.5, GPT-4, and DeepSeek V3) demonstrates substantial room for improvement, as even the best-performing models struggle to achieve competitive performance in Kazakh and Russian. These findings underscore significant performance gaps compared to high-resource languages. We hope that our dataset will enable further research and development of Kazakh-centric LLMs. Data and code will be made available upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Underrepresentation of Kazakh language in NLP
Lack of dedicated Kazakh language models
Performance gaps in multilingual models for Kazakh and Russian
Innovation

Methods, ideas, or system contributions that make the work stand out.

KazMMLU dataset creation
Multilingual model evaluation
Bilingual educational context
🔎 Similar Papers
No similar papers found.