Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Evaluation of Hindi large language models (LLMs) has long been hindered by the absence of high-quality, linguistically and culturally grounded benchmarks; direct translation of English datasets fails to preserve syntactic structures and sociocultural context. Method: We propose a reusable, low-resource-language-oriented benchmark construction methodology featuring human-in-the-loop zero-shot annotation and a dual-track “translate-verify” pipeline, augmented by multi-stage quality control. Contribution/Results: This yields the first comprehensive Hindi evaluation suite—comprising IFEval-Hi (instruction following), MT-Bench-Hi (dialogue), GSM8K-Hi (mathematical reasoning), ChatRAG-Hi (retrieval-augmented generation), and BFCL-Hi (function calling). Using this suite, we conduct the first systematic evaluation of mainstream open-source Hindi LLMs, uncovering cross-task performance bottlenecks and strengths while substantially mitigating linguistic and cultural distortion. Our work establishes a methodological paradigm and foundational infrastructure for AI evaluation in low-resource languages.

Technology Category

Application Category

📝 Abstract

Evaluating instruction-tuned Large Language Models (LLMs) in Hindi is challenging due to a lack of high-quality benchmarks, as direct translation of English datasets fails to capture crucial linguistic and cultural nuances. To address this, we introduce a suite of five Hindi LLM evaluation datasets: IFEval-Hi, MT-Bench-Hi, GSM8K-Hi, ChatRAG-Hi, and BFCL-Hi. These were created using a methodology that combines from-scratch human annotation with a translate-and-verify process. We leverage this suite to conduct an extensive benchmarking of open-source LLMs supporting Hindi, providing a detailed comparative analysis of their current capabilities. Our curation process also serves as a replicable methodology for developing benchmarks in other low-resource languages.

Problem

Research questions and friction points this paper is trying to address.

Lack of high-quality Hindi benchmarks for LLM evaluation

Direct translation fails to capture linguistic and cultural nuances

Need for comprehensive evaluation of Hindi LLM capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created five Hindi evaluation datasets via hybrid methodology

Combined human annotation with translate-and-verify process

Established replicable benchmark methodology for low-resource languages

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks