Multilingual Retrieval Augmented Generation for Culturally-Sensitive Tasks: A Benchmark for Cross-lingual Robustness

📅 2024-10-02

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses geopolitical bias introduced by multilingual RAG systems in culturally sensitive tasks—such as territorial disputes—and proposes the first evaluation framework tailored to such scenarios. We introduce BordIRLines, a benchmark comprising 720 contested queries and 14K Wikipedia documents spanning 49 languages. Our methodology employs a cross-lingual retrieval–reranking–RAG pipeline and defines novel bias quantification metrics: response consistency, citation language entropy, and stance bias score. Experiments reveal that multilingual retrieval reduces geopolitical bias in mainstream LLMs by 23% on average and improves response consistency by 18%; however, low-resource language queries significantly exacerbate imbalances in citation language distribution. To foster reproducibility and advance research on multilingual information fairness, we publicly release the benchmark dataset and implementation code.

Technology Category

Application Category

📝 Abstract

The paradigm of retrieval-augmented generated (RAG) helps mitigate hallucinations of large language models (LLMs). However, RAG also introduces biases contained within the retrieved documents. These biases can be amplified in scenarios which are multilingual and culturally-sensitive, such as territorial disputes. In this paper, we introduce BordIRLines, a benchmark consisting of 720 territorial dispute queries paired with 14k Wikipedia documents across 49 languages. To evaluate LLMs' cross-lingual robustness for this task, we formalize several modes for multilingual retrieval. Our experiments on several LLMs reveal that retrieving multilingual documents best improves response consistency and decreases geopolitical bias over using purely in-language documents, showing how incorporating diverse perspectives improves robustness. Also, querying in low-resource languages displays a much wider variance in the linguistic distribution of response citations. Our further experiments and case studies investigate how cross-lingual RAG is affected by aspects from IR to document contents. We release our benchmark and code to support further research towards ensuring equitable information access across languages at https://huggingface.co/datasets/borderlines/bordirlines.

Problem

Research questions and friction points this paper is trying to address.

Multilingual retrieval-augmented generation

Cultural sensitivity in AI tasks

Cross-lingual robustness evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual retrieval enhances response consistency

Diverse perspectives reduce geopolitical bias

Low-resource languages show citation variance

🔎 Similar Papers

Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You