Facts are Harder Than Opinions -- A Multilingual, Comparative Analysis of LLM-Based Fact-Checking Reliability

๐Ÿ“… 2025-06-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large language models (LLMs) exhibit insufficient reliability in multilingual, multi-domain fact-checking tasks. Method: We construct a dynamic, extensible multilingual fact-checking dataset comprising 61,514 claims and systematically evaluate five state-of-the-art modelsโ€”GPT-4o, GPT-3.5 Turbo, LLaMA 3.1, Mixtral 8x7B, and Qwen2โ€”across cross-lingual and cross-domain settings. We introduce a novel evaluation paradigm integrating timeliness, linguistic diversity, and topical breadth, coupled with multilingual prompt engineering and fine-grained error attribution analysis. Contribution/Results: We identify, for the first time, that LLMs misclassify factual statements at significantly higher rates than opinion statements. GPT-4o achieves the highest accuracy but exhibits a 43% refusal rate. All models demonstrate systematic deficiencies in factual verification, exposing fundamental limitations in their underlying factual reasoning capabilities.

Technology Category

Application Category

๐Ÿ“ Abstract
The proliferation of misinformation necessitates scalable, automated fact-checking solutions. Yet, current benchmarks often overlook multilingual and topical diversity. This paper introduces a novel, dynamically extensible data set that includes 61,514 claims in multiple languages and topics, extending existing datasets up to 2024. Through a comprehensive evaluation of five prominent Large Language Models (LLMs), including GPT-4o, GPT-3.5 Turbo, LLaMA 3.1, and Mixtral 8x7B, we identify significant performance gaps between different languages and topics. While overall GPT-4o achieves the highest accuracy, it declines to classify 43% of claims. Across all models, factual-sounding claims are misclassified more often than opinions, revealing a key vulnerability. These findings underscore the need for caution and highlight challenges in deploying LLM-based fact-checking systems at scale.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM reliability in multilingual fact-checking tasks
Addressing performance gaps across languages and topics in LLMs
Identifying vulnerability in classifying factual claims versus opinions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamically extensible multilingual dataset for fact-checking
Comprehensive evaluation of five prominent LLMs
Identifies performance gaps across languages and topics
๐Ÿ”Ž Similar Papers
No similar papers found.