CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses cross-lingual information retrieval (CLIR) for English queries against French academic literature. We introduce CLIRudit, the first zero-shot bilingual English–French CLIR benchmark, constructed from bilingual metadata on the Érudit platform. Methodologically, we systematically evaluate zero-shot dense retrievers (mBERT, XLM-R, bge-multilingual) and sparse retrievers (BM25, SPLADE), comparing query/document machine translation (MT) and multilingual embedding strategies. Key findings: large-scale dense models—without cross-lingual fine-tuning—achieve performance on par with human-translated baselines in zero-shot settings; sparse models augmented with document-side MT attain both high efficiency and strong effectiveness. We publicly release the dataset, evaluation code, and reproducible protocols to support rigorous, multi-domain, multilingual CLIR research.

Technology Category

Application Category

📝 Abstract
Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from 'Erudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers.
Problem

Research questions and friction points this paper is trying to address.

Evaluate cross-lingual academic search using English-French documents
Benchmark zero-shot retrieval methods without machine translation
Improve accessibility of scientific knowledge across language barriers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses bilingual metadata for dataset creation
Benchmarks dense and sparse retrieval methods
Achieves zero-shot performance without translation
🔎 Similar Papers
No similar papers found.