🤖 AI Summary
This study systematically uncovers dual biases in multilingual large language models (MLLMs) for verified factual claim detection (PFCD): *language bias*—superior performance on high-resource versus low-resource languages—and *retrieval bias*—over-retrieval of high-frequency claims and under-retrieval of low-frequency ones. Using the AMC-16K dataset, we evaluate six open-source MLLMs and multilingual embedding models across 20 languages, employing a holistic multilingual prompting strategy to jointly analyze cross-lingual performance and retrieval frequency. We formally define and quantify retrieval bias for the first time, revealing its interaction with language bias; we further demonstrate that model family, parameter scale, and prompt design significantly moderate fairness. The work introduces an interpretable evaluation framework and practical optimization pathways for multilingual fact-checking fairness, providing both theoretical grounding and empirical evidence toward building more robust and equitable global fact-checking systems.
📝 Abstract
Multilingual Large Language Models (LLMs) offer powerful capabilities for cross-lingual fact-checking. However, these models often exhibit language bias, performing disproportionately better on high-resource languages such as English than on low-resource counterparts. We also present and inspect a novel concept - retrieval bias, when information retrieval systems tend to favor certain information over others, leaving the retrieval process skewed. In this paper, we study language and retrieval bias in the context of Previously Fact-Checked Claim Detection (PFCD). We evaluate six open-source multilingual LLMs across 20 languages using a fully multilingual prompting strategy, leveraging the AMC-16K dataset. By translating task prompts into each language, we uncover disparities in monolingual and cross-lingual performance and identify key trends based on model family, size, and prompting strategy. Our findings highlight persistent bias in LLM behavior and offer recommendations for improving equity in multilingual fact-checking. To investigate retrieval bias, we employed multilingual embedding models and look into the frequency of retrieved claims. Our analysis reveals that certain claims are retrieved disproportionately across different posts, leading to inflated retrieval performance for popular claims while under-representing less common ones.