Facts are Harder Than Opinions -- A Multilingual, Comparative Analysis of LLM-Based Fact-Checking Reliability

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Large language models (LLMs) exhibit insufficient reliability in multilingual, multi-domain fact-checking tasks. Method: We construct a dynamic, extensible multilingual fact-checking dataset comprising 61,514 claims and systematically evaluate five state-of-the-art models—GPT-4o, GPT-3.5 Turbo, LLaMA 3.1, Mixtral 8x7B, and Qwen2—across cross-lingual and cross-domain settings. We introduce a novel evaluation paradigm integrating timeliness, linguistic diversity, and topical breadth, coupled with multilingual prompt engineering and fine-grained error attribution analysis. Contribution/Results: We identify, for the first time, that LLMs misclassify factual statements at significantly higher rates than opinion statements. GPT-4o achieves the highest accuracy but exhibits a 43% refusal rate. All models demonstrate systematic deficiencies in factual verification, exposing fundamental limitations in their underlying factual reasoning capabilities.

Technology Category

Application Category

📝 Abstract

The proliferation of misinformation necessitates scalable, automated fact-checking solutions. Yet, current benchmarks often overlook multilingual and topical diversity. This paper introduces a novel, dynamically extensible data set that includes 61,514 claims in multiple languages and topics, extending existing datasets up to 2024. Through a comprehensive evaluation of five prominent Large Language Models (LLMs), including GPT-4o, GPT-3.5 Turbo, LLaMA 3.1, and Mixtral 8x7B, we identify significant performance gaps between different languages and topics. While overall GPT-4o achieves the highest accuracy, it declines to classify 43% of claims. Across all models, factual-sounding claims are misclassified more often than opinions, revealing a key vulnerability. These findings underscore the need for caution and highlight challenges in deploying LLM-based fact-checking systems at scale.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM reliability in multilingual fact-checking tasks

Addressing performance gaps across languages and topics in LLMs

Identifying vulnerability in classifying factual claims versus opinions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamically extensible multilingual dataset for fact-checking

Comprehensive evaluation of five prominent LLMs

Identifies performance gaps across languages and topics

🔎 Similar Papers

OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMs