๐ค AI Summary
This work addresses the challenge of multimodal misinformation, where misleading textual claims are often amplified through image manipulation or reuse. To tackle this, the authors introduce RW-Post, the first multimodal fact-checking benchmark featuring auditable annotations, comprising social media posts with aligned image-text pairs and explicitly linked evidence items. Leveraging a large language modelโassisted pipeline for evidence extraction and verification, the benchmark supports three evaluation settings: closed-book, evidence-limited, and open-web. The study also establishes AgentFact as a baseline verifier. Experimental results reveal that current open-source large vision-language models (LVLMs) exhibit significant deficiencies in faithful, evidence-grounded reasoning, while evidence-limited evaluation substantially improves both accuracy and reasoning fidelity.
๐ Abstract
Multimodal misinformation increasingly leverages visual persuasion, where repurposed or manipulated images strengthen misleading text. We introduce \textbf{RW-Post}, a post-aligned \textbf{text--image benchmark} for real-world multimodal fact-checking with \emph{auditable} annotations: each instance links the original social-media post with reasoning traces and explicitly linked evidence items derived from human fact-check articles via an LLM-assisted extraction-and-auditing pipeline. RW-Post supports controlled evaluation across closed-book, evidence-bounded, and open-web regimes, enabling systematic diagnosis of visual grounding and evidence utilization. We provide \textbf{AgentFact} as a reference verification baseline and benchmark strong open-source LVLMs under unified protocols. Experiments show substantial headroom: current models struggle with faithful evidence grounding, while evidence-bounded evaluation improves both accuracy and faithfulness. Code and dataset will be released at https://github.com/xudanni0927/AgentFact.