MILU: A Multi-task Indic Language Understanding Benchmark

📅 2024-11-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the severe lack of evaluation benchmarks for large language models (LLMs) in low-resource, non-Latin-script Indian languages, this work introduces INDICBENCH—the first multilingual, multi-domain understanding benchmark covering 11 Indian languages, 8 broad domains, and 41 fine-grained subjects. It uniquely integrates indigenous knowledge (e.g., history, law, festivals) with STEM content. We propose a culturally adaptive evaluation framework grounded in standardized examination items, enabling cross-lingual, multi-granularity assessment—including cross-lingual accuracy analysis, domain-wise performance decomposition, and generalization comparison. Evaluating 42 models, GPT-4o achieves the highest average accuracy (74%). Our analysis reveals, for the first time, systematic weaknesses of current LLMs on Indian languages and humanities/social sciences tasks. All data, code, and evaluation tools are publicly released. Empirical results confirm that open-weight multilingual models consistently outperform domain-specific fine-tuned variants.

Technology Category

Application Category

📝 Abstract
Evaluating Large Language Models (LLMs) in low-resource and linguistically diverse languages remains a significant challenge in NLP, particularly for languages using non-Latin scripts like those spoken in India. Existing benchmarks predominantly focus on English, leaving substantial gaps in assessing LLM capabilities in these languages. We introduce MILU, a Multi task Indic Language Understanding Benchmark, a comprehensive evaluation benchmark designed to address this gap. MILU spans 8 domains and 41 subjects across 11 Indic languages, reflecting both general and culturally specific knowledge. With an India-centric design, incorporates material from regional and state-level examinations, covering topics such as local history, arts, festivals, and laws, alongside standard subjects like science and mathematics. We evaluate over 42 LLMs, and find that current LLMs struggle with MILU, with GPT-4o achieving the highest average accuracy at 74 percent. Open multilingual models outperform language-specific fine-tuned models, which perform only slightly better than random baselines. Models also perform better in high resource languages as compared to low resource ones. Domain-wise analysis indicates that models perform poorly in culturally relevant areas like Arts and Humanities, Law and Governance compared to general fields like STEM. To the best of our knowledge, MILU is the first of its kind benchmark focused on Indic languages, serving as a crucial step towards comprehensive cultural evaluation. All code, benchmarks, and artifacts are publicly available to foster open research.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs in low-resource Indic languages
Addressing gaps in non-Latin script language benchmarks
Assessing LLM performance in culturally specific knowledge domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task Indic Language Benchmark
Evaluates 42 LLMs across 11 languages
Focuses on culturally relevant knowledge areas
🔎 Similar Papers
No similar papers found.