HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This paper addresses the challenge of verifying the factual accuracy of multimodal, cross-domain claims containing synthetic content. We propose a scalable fact-checking paradigm and introduce MVD—the first large-scale (27K instances) multimodal verifiability detection dataset comprising real-synthetic image-text claim pairs—enabling automated identification of check-worthy claims and mitigating high manual effort and delayed response. Methodologically, we pioneer the joint modeling of both authentic and synthetic multimodal claims to support cross-domain and adversarial robustness evaluation. We conduct a systematic benchmark of fine-tuned/prompted LLMs, multimodal large models (MLLMs), and lightweight text encoders. Results show that lightweight encoders match MLLMs in non-claim filtering, while MLLMs exhibit superior robustness on synthetic data at substantially higher computational cost. Our work establishes foundational data, empirical validation, and practical trade-off insights for large-scale automated fact-checking.

Technology Category

Application Category

📝 Abstract

Misinformation can be countered with fact-checking, but the process is costly and slow. Identifying checkworthy claims is the first step, where automation can help scale fact-checkers' efforts. However, detection methods struggle with content that is 1) multimodal, 2) from diverse domains, and 3) synthetic. We introduce HintsOfTruth, a public dataset for multimodal checkworthiness detection with $27$K real-world and synthetic image/claim pairs. The mix of real and synthetic data makes this dataset unique and ideal for benchmarking detection methods. We compare fine-tuned and prompted Large Language Models (LLMs). We find that well-configured lightweight text-based encoders perform comparably to multimodal models but the first only focus on identifying non-claim-like content. Multimodal LLMs can be more accurate but come at a significant computational cost, making them impractical for large-scale applications. When faced with synthetic data, multimodal models perform more robustly

Problem

Research questions and friction points this paper is trying to address.

Detecting checkworthy multimodal claims

Addressing diverse and synthetic content

Evaluating computational cost of detection methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal dataset HintsOfTruth

Fine-tuned and prompted LLMs

Lightweight text-based encoders

🔎 Similar Papers

No similar papers found.