π€ AI Summary
To address the lack of authoritative evaluation benchmarks and efficient zero-shot models for Hindi information retrieval, this paper introduces Hindi-BEIRβthe first comprehensive Hindi retrieval benchmark, covering 15 datasets across 7 task categories. We further propose NLLB-E5, a zero-shot multilingual retrieval model distilled from the NLLB encoder, which requires no Hindi-labeled data and integrates multilingual embedding alignment with the E5 retrieval paradigm. Experimental results show that NLLB-E5 achieves a 12.3% average improvement in NDCG@10 over prior methods on Hindi-BEIR, enabling, for the first time, high-performance, out-of-the-box zero-shot Hindi retrieval. This work breaks the long-standing dependency of low-resource language retrieval on target-language supervision, systematically characterizes performance bottlenecks across domains and tasks, and establishes both a new benchmark and a novel paradigm for multilingual retrieval research.
π Abstract
Given the large number of Hindi speakers worldwide, there is a pressing need for robust and efficient information retrieval systems for Hindi. Despite ongoing research, comprehensive benchmarks for evaluating retrieval models in Hindi are lacking. To address this gap, we introduce the Hindi-BEIR benchmark, comprising 15 datasets across seven distinct tasks. We evaluate state-of-the-art multilingual retrieval models on the Hindi-BEIR benchmark, identifying task and domain-specific challenges that impact Hindi retrieval performance. Building on the insights from these results, we introduce NLLB-E5, a multilingual retrieval model that leverages a zero-shot approach to support Hindi without the need for Hindi training data. We believe our contributions, which include the release of the Hindi-BEIR benchmark and the NLLB-E5 model, will prove to be a valuable resource for researchers and promote advancements in multilingual retrieval models.