MessIRve: A Large-Scale Spanish Information Retrieval Dataset

📅 2024-09-09

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Spanish information retrieval (IR) has long suffered from a lack of large-scale, geographically diverse, and authentic query datasets. This work introduces SpanIR—the first native Spanish IR benchmark—comprising approximately 730,000 real user queries sourced from Google’s autocomplete API, paired with relevant Wikipedia documents. SpanIR spans major Spanish variants across Latin America and the Iberian Peninsula and covers diverse topical domains. Unlike translation-based approaches, it captures genuine cross-regional user behavior, mitigating language shift artifacts. The dataset includes standardized train/dev/test splits, fine-grained metadata annotations (e.g., regional origin, topic), and baseline evaluations using BM25, ColBERT, and ANCE. Experiments demonstrate substantial improvements in recall and ranking robustness on real-world, cross-regional queries. SpanIR is fully open-sourced—including data, annotations, and evaluation code—and has been adopted as a recommended benchmark on Hugging Face.

Technology Category

Application Category

📝 Abstract

Information retrieval (IR) is the task of finding relevant documents in response to a user query. Although Spanish is the second most spoken native language, current IR benchmarks lack Spanish data, hindering the development of information access tools for Spanish speakers. We introduce MessIRve, a large-scale Spanish IR dataset with around 730 thousand queries from Google's autocomplete API and relevant documents sourced from Wikipedia. MessIRve's queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets. We provide a comprehensive description of the dataset, comparisons with existing datasets, and baseline evaluations of prominent IR models. Our contributions aim to advance Spanish IR research and improve information access for Spanish speakers.

Problem

Research questions and friction points this paper is trying to address.

Addressing the scarcity of Spanish information retrieval datasets

Providing a large-scale dataset with diverse regional Spanish queries

Enabling development of information access tools for Spanish speakers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created large-scale Spanish IR dataset from native queries

Sourced diverse regional queries via Google autocomplete API

Provided Wikipedia documents and baseline IR model evaluations

🔎 Similar Papers

No similar papers found.