🤖 AI Summary
Evaluation of Hindi large language models (LLMs) has long been hindered by the absence of high-quality, linguistically and culturally grounded benchmarks; direct translation of English datasets fails to preserve syntactic structures and sociocultural context. Method: We propose a reusable, low-resource-language-oriented benchmark construction methodology featuring human-in-the-loop zero-shot annotation and a dual-track “translate-verify” pipeline, augmented by multi-stage quality control. Contribution/Results: This yields the first comprehensive Hindi evaluation suite—comprising IFEval-Hi (instruction following), MT-Bench-Hi (dialogue), GSM8K-Hi (mathematical reasoning), ChatRAG-Hi (retrieval-augmented generation), and BFCL-Hi (function calling). Using this suite, we conduct the first systematic evaluation of mainstream open-source Hindi LLMs, uncovering cross-task performance bottlenecks and strengths while substantially mitigating linguistic and cultural distortion. Our work establishes a methodological paradigm and foundational infrastructure for AI evaluation in low-resource languages.
📝 Abstract
Evaluating instruction-tuned Large Language Models (LLMs) in Hindi is challenging due to a lack of high-quality benchmarks, as direct translation of English datasets fails to capture crucial linguistic and cultural nuances. To address this, we introduce a suite of five Hindi LLM evaluation datasets: IFEval-Hi, MT-Bench-Hi, GSM8K-Hi, ChatRAG-Hi, and BFCL-Hi. These were created using a methodology that combines from-scratch human annotation with a translate-and-verify process. We leverage this suite to conduct an extensive benchmarking of open-source LLMs supporting Hindi, providing a detailed comparative analysis of their current capabilities. Our curation process also serves as a replicable methodology for developing benchmarks in other low-resource languages.