Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Task

📅 2025-04-04

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

This work addresses the challenges of insufficient cross-lingual coverage and semantic misalignment in retrieval-augmented generation (RAG) for multilingual open-domain question answering. We propose CrossRAG, a cross-lingual RAG framework that uniformly back-translates multilingual retrieval results into the target language (e.g., English) prior to answer generation. CrossRAG is the first to systematically integrate machine translation, multilingual retrieval (using mBERT/XTREME), cross-lingual embedding alignment, and LLM-based conditional generation—thereby overcoming key limitations of monolingual RAG. Experiments on XQuAD, MLQA, and other benchmarks demonstrate an average +8.2 F1 improvement; notably, performance gains are most pronounced for low-resource languages (e.g., Swahili, Bengali), while inference efficiency surpasses full-translation pipelines. Our core contribution lies in establishing cross-lingual document alignment as a critical factor for generation quality, and in positioning CrossRAG as a new paradigm for multilingual, knowledge-intensive NLP tasks.

Technology Category

Application Category

📝 Abstract

Retrieval-augmented generation (RAG) has become a cornerstone of contemporary NLP, enhancing large language models (LLMs) by allowing them to access richer factual contexts through in-context retrieval. While effective in monolingual settings, especially in English, its use in multilingual tasks remains unexplored. This paper investigates the effectiveness of RAG across multiple languages by proposing novel approaches for multilingual open-domain question-answering. We evaluate the performance of various multilingual RAG strategies, including question-translation (tRAG), which translates questions into English before retrieval, and Multilingual RAG (MultiRAG), where retrieval occurs directly across multiple languages. Our findings reveal that tRAG, while useful, suffers from limited coverage. In contrast, MultiRAG improves efficiency by enabling multilingual retrieval but introduces inconsistencies due to cross-lingual variations in the retrieved content. To address these issues, we propose Crosslingual RAG (CrossRAG), a method that translates retrieved documents into a common language (e.g., English) before generating the response. Our experiments show that CrossRAG significantly enhances performance on knowledge-intensive tasks, benefiting both high-resource and low-resource languages.

Problem

Research questions and friction points this paper is trying to address.

Evaluates RAG effectiveness in multilingual question-answering tasks

Compares tRAG and MultiRAG strategies for multilingual retrieval

Proposes CrossRAG to improve cross-lingual consistency and performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual RAG for cross-language question-answering

Question-translation RAG improves English retrieval

CrossRAG translates documents for consistent responses

🔎 Similar Papers

Retrieval-Augmented Generation for Natural Language Processing: A Survey