🤖 AI Summary
This study identifies a significant cross-lingual factual alignment bias in multilingual large language models (LLMs) for medical question answering: mainstream models predominantly rely on English knowledge sources, leading to substantial degradation in factual consistency and knowledge coverage for non-English queries. To address this, we introduce MultiWikiHealthCare—the first multilingual medical evaluation benchmark covering English, German, Turkish, Chinese, and Italian—and propose a retrieval-augmented generation (RAG) framework with target-language context injection, grounded in Wikidata. Our systematic evaluation measures factual alignment between model outputs and multilingual reference sources. Empirically, we demonstrate that injecting target-language contextual knowledge significantly improves factual alignment for non-English languages (average +23.6%), mitigating culture-specific knowledge gaps and English-centric biases. This work establishes a new paradigm for equitable and reliable multilingual medical AI.
📝 Abstract
Equitable access to reliable health information is vital when integrating AI into healthcare. Yet, information quality varies across languages, raising concerns about the reliability and consistency of multilingual Large Language Models (LLMs). We systematically examine cross-lingual disparities in pre-training source and factuality alignment in LLM answers for multilingual healthcare Q&A across English, German, Turkish, Chinese (Mandarin), and Italian. We (i) constructed Multilingual Wiki Health Care (MultiWikiHealthCare), a multilingual dataset from Wikipedia; (ii) analyzed cross-lingual healthcare coverage; (iii) assessed LLM response alignment with these references; and (iv) conducted a case study on factual alignment through the use of contextual information and Retrieval-Augmented Generation (RAG). Our findings reveal substantial cross-lingual disparities in both Wikipedia coverage and LLM factual alignment. Across LLMs, responses align more with English Wikipedia, even when the prompts are non-English. Providing contextual excerpts from non-English Wikipedia at inference time effectively shifts factual alignment toward culturally relevant knowledge. These results highlight practical pathways for building more equitable, multilingual AI systems for healthcare.