IndicParam: Benchmark to evaluate LLMs on low-resource Indic Languages

📅 2025-11-29

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Existing large language models (LLMs) lack fine-grained, high-quality evaluation benchmarks for low-resource Indian languages (e.g., Nepali, Gujarati). Method: We introduce IndicEval—the first multidimensional benchmark covering 11 low- and extremely low-resource Indian languages, comprising over 13,000 human-annotated multiple-choice questions. It features a novel decoupled evaluation framework distinguishing *knowledge competence* from *linguistic competence*, and systematically incorporates challenging task types—including code-mixed inputs, list matching, causal reasoning, and sequence ordering. Contribution/Results: Evaluating 19 state-of-the-art LLMs reveals that even GPT-5 achieves only 45.0% average accuracy, exposing critical limitations in cross-lingual knowledge transfer. IndicEval provides a reproducible, extensible, standardized assessment toolkit for low-resource multilingual evaluation, advancing joint research on linguistic understanding and knowledge generalization.

Technology Category

Application Category

📝 Abstract

While large language models excel on high-resource multilingual tasks, low- and extremely low-resource Indic languages remain severely under-evaluated. We present IndicParam, a human-curated benchmark of over 13,000 multiple-choice questions covering 11 such languages (Nepali, Gujarati, Marathi, Odia as low-resource; Dogri, Maithili, Rajasthani, Sanskrit, Bodo, Santali, Konkani as extremely low-resource) plus Sanskrit-English code-mixed set. We evaluated 19 LLMs, both proprietary and open-weights, which reveals that even the top-performing GPT-5 reaches only 45.0% average accuracy, followed by DeepSeek-3.2 (43.1) and Claude-4.5 (42.7). We additionally label each question as knowledge-oriented or purely linguistic to discriminate factual recall from grammatical proficiency. Further, we assess the ability of LLMs to handle diverse question formats-such as list-based matching, assertion-reason pairs, and sequence ordering-alongside conventional multiple-choice questions. IndicParam provides insights into limitations of cross-lingual transfer and establishes a challenging benchmark for Indic languages. The dataset is available at https://huggingface.co/datasets/bharatgenai/IndicParam. Scripts to run benchmark are present at https://github.com/ayushbits/IndicParam.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs on low-resource Indic languages lacking benchmarks.

Assesses factual recall vs. grammatical proficiency in diverse question formats.

Reveals cross-lingual transfer limitations with challenging accuracy scores.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-curated benchmark for low-resource Indic languages

Evaluates LLMs on knowledge and linguistic proficiency

Assesses diverse question formats beyond multiple-choice

🔎 Similar Papers

INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages