The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This work identifies a significant retrieval bias in cross-lingual RAG for Arabic–English settings, arising from query–document language misalignment—particularly degrading performance on domain-specific tasks. To address this, we propose a forced bilingual balanced retrieval strategy that constrains the retrieval distribution in the cross-lingual embedding space to mitigate cross-lingual ranking bias. We construct the first Arabic–English domain-specific multilingual RAG benchmark using real enterprise data. Through controlled comparative experiments and cross-lingual embedding analysis, we systematically validate our approach: it preserves monolingual retrieval performance while improving cross-lingual recall by up to 23.6% and significantly enhancing end-to-end generation quality (measured by BLEU and ROUGE). This study establishes a reproducible paradigm and practical pathway for improving fairness and robustness in multilingual RAG systems.

Technology Category

Application Category

📝 Abstract

Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages. Prior work in this context has mostly focused on generation and relied on benchmarks derived from open-domain sources, most notably Wikipedia. In such settings, retrieval challenges often remain hidden due to language imbalances, overlap with pretraining data, and memorized content. To address this gap, we study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets. Our benchmarks include all combinations of languages for the user query and the supporting document, drawn independently and uniformly at random. This enables a systematic study of multilingual retrieval behavior. Our findings reveal that retrieval is a critical bottleneck in cross-lingual domain-specific scenarios, with significant performance drops occurring when the user query and supporting document languages differ. A key insight is that these failures stem primarily from the retriever's difficulty in ranking documents across languages. Finally, we propose a simple retrieval strategy that addresses this source of failure by enforcing equal retrieval from both languages, resulting in substantial improvements in cross-lingual and overall performance. These results highlight meaningful opportunities for improving multilingual retrieval, particularly in practical, real-world RAG applications.

Problem

Research questions and friction points this paper is trying to address.

Study cross-lingual retrieval biases in Arabic-English RAG

Analyze performance drops in domain-specific multilingual retrieval

Propose strategy to improve cross-lingual document ranking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Arabic-English RAG in domain-specific setting

Equal retrieval from both languages strategy

Real-world corporate datasets benchmarks

🔎 Similar Papers

Faux Polyglot: A Study on Information Disparity in Multilingual Large Language Models