XRAG: Cross-lingual Retrieval-Augmented Generation

📅 2025-05-15

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This paper addresses the challenge of evaluating LLM generation capabilities in cross-lingual Retrieval-Augmented Generation (RAG), where user queries and retrieved documents differ in language. To this end, we introduce XRAG—the first dedicated benchmark for cross-lingual RAG evaluation. XRAG encompasses realistic monolingual and multilingual retrieval scenarios, features fine-grained relevance annotations, and includes questions requiring external knowledge and complex reasoning. Our analysis uncovers two previously overlooked challenges: high response-language mismatch rates in monolingual retrieval, and weak cross-lingual reasoning—particularly in non-English generation—under multilingual retrieval. We propose a comprehensive evaluation paradigm integrating automated question generation from multilingual news corpora, human-curated multilingual relevance annotations, simulated cross-lingual retrieval, and dual-dimensional assessment of language consistency and cross-lingual reasoning. Evaluating five state-of-the-art LLMs reveals significant human–machine performance gaps, quantitatively pinpoints bottlenecks, and delivers a reproducible, modular, diagnostic benchmark.

Technology Category

Application Category

📝 Abstract

We propose XRAG, a novel benchmark designed to evaluate the generation abilities of LLMs in cross-lingual Retrieval-Augmented Generation (RAG) settings where the user language does not match the retrieval results. XRAG is constructed from recent news articles to ensure that its questions require external knowledge to be answered. It covers the real-world scenarios of monolingual and multilingual retrieval, and provides relevancy annotations for each retrieved document. Our novel dataset construction pipeline results in questions that require complex reasoning, as evidenced by the significant gap between human and LLM performance. Consequently, XRAG serves as a valuable benchmark for studying LLM reasoning abilities, even before considering the additional cross-lingual complexity. Experimental results on five LLMs uncover two previously unreported challenges in cross-lingual RAG: 1) in the monolingual retrieval setting, all evaluated models struggle with response language correctness; 2) in the multilingual retrieval setting, the main challenge lies in reasoning over retrieved information across languages rather than generation of non-English text.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' cross-lingual RAG generation abilities

Assessing monolingual and multilingual retrieval relevancy challenges

Identifying reasoning gaps in cross-lingual information synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-lingual RAG benchmark for LLM evaluation

Monolingual and multilingual retrieval scenarios

Complex reasoning questions from news articles

🔎 Similar Papers

FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research