🤖 AI Summary
Multimodal disinformation—such as manipulated images paired with misleading text—poses significant challenges to conventional fact-checking on social media due to its cross-modal semantic inconsistency. Method: We propose the first retrieval-augmented, zero-shot verification framework for multimodal misinformation. It constructs a claim-driven graph structure to model entities and relations, integrates CLIP-based visual features with knowledge graphs, and performs real-time cross-modal retrieval of trustworthy external evidence to detect image-text semantic misalignment. The framework supports fine-grained, interpretable, element-wise verification, explicitly labeling credible or suspicious image/text segments. Contribution/Results: This work establishes the first zero-shot multimodal verification paradigm, achieves state-of-the-art performance on mainstream benchmarks, and delivers highly transparent, traceable verification reports with explicit evidence grounding.
📝 Abstract
The rise of disinformation on social media, especially through the strategic manipulation or repurposing of images, paired with provocative text, presents a complex challenge for traditional fact-checking methods. In this paper, we introduce a novel zero-shot approach to identify and interpret such multimodal disinformation, leveraging real-time evidence from credible sources. Our framework goes beyond simple true-or-false classifications by analyzing both the textual and visual components of social media claims in a structured, interpretable manner. By constructing a graph-based representation of entities and relationships within the claim, combined with pretrained visual features, our system automatically retrieves and matches external evidence to identify inconsistencies. Unlike traditional models dependent on labeled datasets, our method empowers users with transparency, illuminating exactly which aspects of the claim hold up to scrutiny and which do not. Our framework achieves competitive performance with state-of-the-art methods while offering enhanced explainability.