Beyond Translation: LLM-Based Data Generation for Multilingual Fact-Checking

📅 2025-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing fact-checking research is heavily skewed toward English, with inadequate support for multilingual—especially low-resource—languages. To address this, we propose MultiSynFact, the first large-scale multilingual fact-checking dataset comprising 2.2 million claim-source text pairs across English, Spanish, German, and multiple low-resource languages. Our method innovatively integrates large language models with external Wikipedia knowledge to design a multi-step claim veracity verification pipeline, and we develop an open-source, reusable multilingual data generation framework. Leveraging multilingual prompt engineering, knowledge-augmented retrieval, and rigorous data filtering, our approach improves model performance by an average of 12.7% in F1 score across four languages. The dataset and code are publicly released, establishing critical infrastructure and a scalable methodology for fact-checking in low-resource languages.

Technology Category

Application Category

📝 Abstract
Robust automatic fact-checking systems have the potential to combat online misinformation at scale. However, most existing research primarily focuses on English. In this paper, we introduce MultiSynFact, the first large-scale multilingual fact-checking dataset containing 2.2M claim-source pairs designed to support Spanish, German, English, and other low-resource languages. Our dataset generation pipeline leverages Large Language Models (LLMs), integrating external knowledge from Wikipedia and incorporating rigorous claim validation steps to ensure data quality. We evaluate the effectiveness of MultiSynFact across multiple models and experimental settings. Additionally, we open-source a user-friendly framework to facilitate further research in multilingual fact-checking and dataset generation.
Problem

Research questions and friction points this paper is trying to address.

Multilingual fact-checking dataset creation
Leverage LLMs for data generation
Support for low-resource languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs for dataset generation
Multilingual claim-source pairs
Open-source framework integration
🔎 Similar Papers
No similar papers found.
Yi-Ling Chung
Yi-Ling Chung
Multiverse Computing
NLPLLMmodel evaluationdataonline safety
A
Aurora Cobo
Genaios Safe AI
P
Pablo Serna
Genaios Safe AI