🤖 AI Summary
This study identifies and quantifies a significant geographical performance gap in large language models (LLMs) for fact-checking: models achieve, on average, 18.7% higher accuracy in Global North regions than in the Global South. We systematically evaluate GPT-4, Claude Sonnet, and LLaMA across six geopolitical regions, comparing three paradigms—vanilla generation, Wikipedia-based proxy verification, and retrieval-augmented generation (RAG). Contrary to expectations, we empirically demonstrate that the Wikipedia proxy approach exacerbates the North–South disparity, revealing critical coverage gaps in general-purpose knowledge bases for region-specific factual claims. Based on a balanced, cross-regional dataset of 600 manually annotated statements, we attribute the fairness bottleneck to biased retrieval strategies and insufficient representativeness of regionally diverse facts in training data. Our findings provide foundational empirical evidence and methodological insights for developing globally inclusive fact-checking systems.
📝 Abstract
Fact-checking is a potentially useful application of Large Language Models (LLMs) to combat the growing dissemination of disinformation. However, the performance of LLMs varies across geographic regions. In this paper, we evaluate the factual accuracy of open and private models across a diverse set of regions and scenarios. Using a dataset containing 600 fact-checked statements balanced across six global regions we examine three experimental setups of fact-checking a statement: (1) when just the statement is available, (2) when an LLM-based agent with Wikipedia access is utilized, and (3) as a best case scenario when a Retrieval-Augmented Generation (RAG) system provided with the official fact check is employed. Our findings reveal that regardless of the scenario and LLM used, including GPT-4, Claude Sonnet, and LLaMA, statements from the Global North perform substantially better than those from the Global South. Furthermore, this gap is broadened for the more realistic case of a Wikipedia agent-based system, highlighting that overly general knowledge bases have a limited ability to address region-specific nuances. These results underscore the urgent need for better dataset balancing and robust retrieval strategies to enhance LLM fact-checking capabilities, particularly in geographically diverse contexts.