๐ค AI Summary
Large language models (LLMs) exhibit insufficient reliability in multilingual, multi-domain fact-checking tasks. Method: We construct a dynamic, extensible multilingual fact-checking dataset comprising 61,514 claims and systematically evaluate five state-of-the-art modelsโGPT-4o, GPT-3.5 Turbo, LLaMA 3.1, Mixtral 8x7B, and Qwen2โacross cross-lingual and cross-domain settings. We introduce a novel evaluation paradigm integrating timeliness, linguistic diversity, and topical breadth, coupled with multilingual prompt engineering and fine-grained error attribution analysis. Contribution/Results: We identify, for the first time, that LLMs misclassify factual statements at significantly higher rates than opinion statements. GPT-4o achieves the highest accuracy but exhibits a 43% refusal rate. All models demonstrate systematic deficiencies in factual verification, exposing fundamental limitations in their underlying factual reasoning capabilities.
๐ Abstract
The proliferation of misinformation necessitates scalable, automated fact-checking solutions. Yet, current benchmarks often overlook multilingual and topical diversity. This paper introduces a novel, dynamically extensible data set that includes 61,514 claims in multiple languages and topics, extending existing datasets up to 2024. Through a comprehensive evaluation of five prominent Large Language Models (LLMs), including GPT-4o, GPT-3.5 Turbo, LLaMA 3.1, and Mixtral 8x7B, we identify significant performance gaps between different languages and topics. While overall GPT-4o achieves the highest accuracy, it declines to classify 43% of claims. Across all models, factual-sounding claims are misclassified more often than opinions, revealing a key vulnerability. These findings underscore the need for caution and highlight challenges in deploying LLM-based fact-checking systems at scale.