Faux Polyglot: A Study on Information Disparity in Multilingual Large Language Models

📅 2024-07-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work identifies a systematic language bias in multilingual large language models (LLMs) during cross-lingual retrieval-augmented generation (RAG) for information search: models exhibit both strong query-language preference—favoring documents matching the query’s language—and overreliance on high-resource language content when relevant information is absent in the query language, thereby marginalizing perspectives from low-resource languages. To investigate this, we construct a multilingual benchmark of factual and opinion-based queries and conduct cross-lingual retrieval evaluation, bias quantification, and controllable generation experiments within a unified RAG framework. Our empirical study is the first to demonstrate that mainstream multilingual LLMs—including mBERT, XLM-R, and Qwen2-MoE—consistently reproduce this bias across diverse language pairs. These findings directly challenge the implicit assumption that multilinguality inherently ensures informational equity and provide evidence that such biases may exacerbate global linguistic inequality.

Technology Category

Application Category

📝 Abstract

Although the multilingual capability of LLMs offers new opportunities to overcome the language barrier, do these capabilities translate into real-life scenarios where linguistic divide and knowledge conflicts between multilingual sources are known occurrences? In this paper, we studied LLM's linguistic preference in a cross-language RAG-based information search setting. We found that LLMs displayed systemic bias towards information in the same language as the query language in both document retrieval and answer generation. Furthermore, in scenarios where no information is in the language of the query, LLMs prefer documents in high-resource languages during generation, potentially reinforcing the dominant views. Such bias exists for both factual and opinion-based queries. Our results highlight the linguistic divide within multilingual LLMs in information search systems. The seemingly beneficial multilingual capability of LLMs may backfire on information parity by reinforcing language-specific information cocoons or filter bubbles further marginalizing low-resource views.

Problem

Research questions and friction points this paper is trying to address.

LLMs exhibit systemic language bias

High-resource languages dominate information retrieval

Multilingual LLMs reinforce information disparity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-language RAG-based information search

LLMs display systemic language bias

Preference for high-resource language documents

🔎 Similar Papers

Selected Languages are All You Need for Cross-lingual Truthfulness Transfer