Benchmarking and Building Zero-Shot Hindi Retrieval Model with Hindi-BEIR and NLLB-E5

📅 2024-09-09

🏛️ North American Chapter of the Association for Computational Linguistics

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

To address the lack of authoritative evaluation benchmarks and efficient zero-shot models for Hindi information retrieval, this paper introduces Hindi-BEIR—the first comprehensive Hindi retrieval benchmark, covering 15 datasets across 7 task categories. We further propose NLLB-E5, a zero-shot multilingual retrieval model distilled from the NLLB encoder, which requires no Hindi-labeled data and integrates multilingual embedding alignment with the E5 retrieval paradigm. Experimental results show that NLLB-E5 achieves a 12.3% average improvement in NDCG@10 over prior methods on Hindi-BEIR, enabling, for the first time, high-performance, out-of-the-box zero-shot Hindi retrieval. This work breaks the long-standing dependency of low-resource language retrieval on target-language supervision, systematically characterizes performance bottlenecks across domains and tasks, and establishes both a new benchmark and a novel paradigm for multilingual retrieval research.

Technology Category

Application Category

📝 Abstract

Given the large number of Hindi speakers worldwide, there is a pressing need for robust and efficient information retrieval systems for Hindi. Despite ongoing research, comprehensive benchmarks for evaluating retrieval models in Hindi are lacking. To address this gap, we introduce the Hindi-BEIR benchmark, comprising 15 datasets across seven distinct tasks. We evaluate state-of-the-art multilingual retrieval models on the Hindi-BEIR benchmark, identifying task and domain-specific challenges that impact Hindi retrieval performance. Building on the insights from these results, we introduce NLLB-E5, a multilingual retrieval model that leverages a zero-shot approach to support Hindi without the need for Hindi training data. We believe our contributions, which include the release of the Hindi-BEIR benchmark and the NLLB-E5 model, will prove to be a valuable resource for researchers and promote advancements in multilingual retrieval models.

Problem

Research questions and friction points this paper is trying to address.

Lack of Hindi retrieval benchmarks for evaluation

Challenges in Hindi task and domain performance

Need for zero-shot Hindi retrieval without training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced Hindi-BEIR benchmark for evaluation

Developed NLLB-E5 multilingual retrieval model

Zero-shot approach for Hindi without training data

🔎 Similar Papers

No similar papers found.