🤖 AI Summary
To address the longstanding absence of evaluation benchmarks and high-quality multilingual training data for Retrieval-Augmented Generation (RAG) systems targeting Indian languages, this work introduces the first large-scale RAG resource suite for Indian languages. Methodologically, we (1) construct IndicMSMarco—a comprehensive end-to-end RAG benchmark covering 13 Indian languages with 1,000 human-annotated queries; and (2) release a massive multilingual training dataset, uniquely integrating Wikipedia corpora from 19 Indian languages with translation-augmented MS MARCO passages. We propose a novel dual-path construction paradigm combining human expert translation with LLM-assisted extraction and cross-lingual alignment to ensure high-fidelity cross-lingual question-answer-passage triplets. Empirical results demonstrate substantial improvements in retrieval accuracy and generation faithfulness—particularly for low-resource Indian languages such as Hindi. All resources are publicly released on Hugging Face.
📝 Abstract
Retrieval-Augmented Generation (RAG) systems enable language models to access relevant information and generate accurate, well-grounded, and contextually informed responses. However, for Indian languages, the development of high-quality RAG systems is hindered by the lack of two critical resources: (1) evaluation benchmarks for retrieval and generation tasks, and (2) large-scale training datasets for multilingual retrieval. Most existing benchmarks and datasets are centered around English or high-resource languages, making it difficult to extend RAG capabilities to the diverse linguistic landscape of India. To address the lack of evaluation benchmarks, we create IndicMSMarco, a multilingual benchmark for evaluating retrieval quality and response generation in 13 Indian languages, created via manual translation of 1000 diverse queries from MS MARCO-dev set. To address the need for training data, we build a large-scale dataset of (question, answer, relevant passage) tuples derived from the Wikipedias of 19 Indian languages using state-of-the-art LLMs. Additionally, we include translated versions of the original MS MARCO dataset to further enrich the training data and ensure alignment with real-world information-seeking tasks. Resources are available here: https://huggingface.co/collections/ai4bharat/indicragsuite-683e7273cb2337208c8c0fcb