MessIRve: A Large-Scale Spanish Information Retrieval Dataset

📅 2024-09-09
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Spanish information retrieval (IR) has long suffered from a lack of large-scale, geographically diverse, and authentic query datasets. This work introduces SpanIR—the first native Spanish IR benchmark—comprising approximately 730,000 real user queries sourced from Google’s autocomplete API, paired with relevant Wikipedia documents. SpanIR spans major Spanish variants across Latin America and the Iberian Peninsula and covers diverse topical domains. Unlike translation-based approaches, it captures genuine cross-regional user behavior, mitigating language shift artifacts. The dataset includes standardized train/dev/test splits, fine-grained metadata annotations (e.g., regional origin, topic), and baseline evaluations using BM25, ColBERT, and ANCE. Experiments demonstrate substantial improvements in recall and ranking robustness on real-world, cross-regional queries. SpanIR is fully open-sourced—including data, annotations, and evaluation code—and has been adopted as a recommended benchmark on Hugging Face.

Technology Category

Application Category

📝 Abstract
Information retrieval (IR) is the task of finding relevant documents in response to a user query. Although Spanish is the second most spoken native language, current IR benchmarks lack Spanish data, hindering the development of information access tools for Spanish speakers. We introduce MessIRve, a large-scale Spanish IR dataset with around 730 thousand queries from Google's autocomplete API and relevant documents sourced from Wikipedia. MessIRve's queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets. We provide a comprehensive description of the dataset, comparisons with existing datasets, and baseline evaluations of prominent IR models. Our contributions aim to advance Spanish IR research and improve information access for Spanish speakers.
Problem

Research questions and friction points this paper is trying to address.

Addressing the scarcity of Spanish information retrieval datasets
Providing a large-scale dataset with diverse regional Spanish queries
Enabling development of information access tools for Spanish speakers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created large-scale Spanish IR dataset from native queries
Sourced diverse regional queries via Google autocomplete API
Provided Wikipedia documents and baseline IR model evaluations
🔎 Similar Papers
No similar papers found.
Francisco Valentini
Francisco Valentini
PhD student, Applied Artificial Intelligence Lab., ICC, UBA-CONICET
Artificial IntelligenceNatural Language ProcessingMachine Learning
Viviana Cotik
Viviana Cotik
Universidad de Buenos Aires
artificial intelligencenatural language processingmachine learningdata qualitydata mining
D
D. Furman
CONICET-Universidad de Buenos Aires. Instituto de Ciencias de la Computación (ICC). Buenos Aires, Argentina
Ivan Bercovich
Ivan Bercovich
University of California Santa Barbara
LLMsInformation Retrieval
E
E. Altszyler
Quantit
J
Juan Manuel P'erez
CONICET-Universidad de Buenos Aires. Instituto de Ciencias de la Computación (ICC). Buenos Aires, Argentina