Multilingual, Multimodal Pipeline for Creating Authentic and Structured Fact-Checked Claim Dataset

📅 2026-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses critical limitations in existing fact-checking datasets—namely, insufficient multilingual coverage, inadequate multimodal evidence integration, lack of structured annotations, and coarse-grained claim-evidence alignment—which hinder research on interpretable and cross-lingual misinformation detection. To overcome these challenges, the authors propose an end-to-end pipeline that aggregates ClaimReview sources and retrieves full debunking articles, standardizes heterogeneous verdicts, and fuses structured metadata with aligned visual content to construct the first French and German multimodal fact-checking datasets. The work introduces a novel fine-grained evidence categorization scheme coupled with a verdict linkage mechanism, leveraging large language models and multimodal large models to automatically extract evidence and generate explanatory rationales. Evaluations via G-Eval and human assessment confirm that the dataset effectively supports the development of interpretable, evidence-based fact-checking models, establishing a foundation for multilingual, multimodal misinformation research.

Technology Category

Application Category

📝 Abstract
The rapid proliferation of misinformation across online platforms underscores the urgent need for robust, up-to-date, explainable, and multilingual fact-checking resources. However, existing datasets are limited in scope, often lacking multimodal evidence, structured annotations, and detailed links between claims, evidence, and verdicts. This paper introduces a comprehensive data collection and processing pipeline that constructs multimodal fact-checking datasets in French and German languages by aggregating ClaimReview feeds, scraping full debunking articles, normalizing heterogeneous claim verdicts, and enriching them with structured metadata and aligned visual content. We used state-of-the-art large language models (LLMs) and multimodal LLMs for (i) evidence extraction under predefined evidence categories and (ii) justification generation that links evidence to verdicts. Evaluation with G-Eval and human assessment demonstrates that our pipeline enables fine-grained comparison of fact-checking practices across different organizations or media markets, facilitates the development of more interpretable and evidence-grounded fact-checking models, and lays the groundwork for future research on multilingual, multimodal misinformation verification.
Problem

Research questions and friction points this paper is trying to address.

fact-checking
multilingual
multimodal
misinformation
structured dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual fact-checking
multimodal pipeline
structured claim dataset
evidence extraction
justification generation
🔎 Similar Papers
No similar papers found.
Z
Z. M. Husunbeyi
Ruhr-Universität Bochum
V
Virginie Mouilleron
Inria Paris
L
Leonie Uhling
Ruhr-Universität Bochum
D
Daniel Foppe
Ruhr-Universität Bochum
Tatjana Scheffler
Tatjana Scheffler
Ruhr-Universität Bochum, Germany
Computational LinguisticsDiscoursePragmaticsSemantics
Djamé Seddah
Djamé Seddah
Inria (Almanach)
LLMsdata set developmentlow-resource languagesArabic dialectsUGC