🤖 AI Summary
Large language models (LLMs) exhibit severely degraded performance on China’s minority languages—Tibetan, Uyghur, Kazakh, and Mongolian—due to their low-resource status, multigraphic nature (e.g., Tibetan script, Arabic-based Uyghur/Kazakh, Cyrillic/Mongolian script), and rich morphological complexity. Method: This paper introduces MiLiC-Eval, the first systematic, multi-task evaluation benchmark tailored for these languages. It encompasses nine language understanding and reasoning tasks, with 24K human-verified samples, enabling unified, fine-grained assessment across all four languages and their non-Latin scripts. We propose a novel evaluation framework emphasizing grammatical sensitivity and cross-script capability. Results: Experiments reveal that state-of-the-art multilingual LLMs achieve sub-40% average accuracy on grammar-intensive and multi-script tasks. MiLiC-Eval provides standardized prompt templates, Unicode-compatible automated scoring, and reproducible diagnostic tools—establishing a foundational benchmark for low-resource language model adaptation and capability advancement.
📝 Abstract
Large language models (LLMs) excel in high-resource languages but struggle with low-resource languages (LRLs), particularly those spoken by minority communities in China, such as Tibetan, Uyghur, Kazakh, and Mongolian. To systematically track the progress in these languages, we introduce MiLiC-Eval, a benchmark designed for minority languages in China, featuring 24K instances across 9 tasks. MiLiC-Eval focuses on underrepresented writing systems and provides a fine-grained assessment of linguistic and problem-solving skills. Our evaluation reveals that LLMs perform poorly on syntax-intensive tasks and multi-script languages. We further demonstrate how MiLiC-Eval can help advance LRL research in handling diverse writing systems and understanding the process of language adaptation.