🤖 AI Summary
Existing large language models (LLMs) lack fine-grained, high-quality evaluation benchmarks for low-resource Indian languages (e.g., Nepali, Gujarati). Method: We introduce IndicEval—the first multidimensional benchmark covering 11 low- and extremely low-resource Indian languages, comprising over 13,000 human-annotated multiple-choice questions. It features a novel decoupled evaluation framework distinguishing *knowledge competence* from *linguistic competence*, and systematically incorporates challenging task types—including code-mixed inputs, list matching, causal reasoning, and sequence ordering. Contribution/Results: Evaluating 19 state-of-the-art LLMs reveals that even GPT-5 achieves only 45.0% average accuracy, exposing critical limitations in cross-lingual knowledge transfer. IndicEval provides a reproducible, extensible, standardized assessment toolkit for low-resource multilingual evaluation, advancing joint research on linguistic understanding and knowledge generalization.
📝 Abstract
While large language models excel on high-resource multilingual tasks, low- and extremely low-resource Indic languages remain severely under-evaluated. We present IndicParam, a human-curated benchmark of over 13,000 multiple-choice questions covering 11 such languages (Nepali, Gujarati, Marathi, Odia as low-resource; Dogri, Maithili, Rajasthani, Sanskrit, Bodo, Santali, Konkani as extremely low-resource) plus Sanskrit-English code-mixed set. We evaluated 19 LLMs, both proprietary and open-weights, which reveals that even the top-performing GPT-5 reaches only 45.0% average accuracy, followed by DeepSeek-3.2 (43.1) and Claude-4.5 (42.7). We additionally label each question as knowledge-oriented or purely linguistic to discriminate factual recall from grammatical proficiency. Further, we assess the ability of LLMs to handle diverse question formats-such as list-based matching, assertion-reason pairs, and sequence ordering-alongside conventional multiple-choice questions. IndicParam provides insights into limitations of cross-lingual transfer and establishes a challenging benchmark for Indic languages. The dataset is available at https://huggingface.co/datasets/bharatgenai/IndicParam. Scripts to run benchmark are present at https://github.com/ayushbits/IndicParam.