🤖 AI Summary
Widespread multimodal misinformation—specifically, authentic images paired with deceptive textual claims—poses significant societal risks, yet existing detection methods lack interpretability and rely heavily on labeled data. Method: This paper proposes a cross-modal consistency detection framework coupled with zero-shot contextualized warning generation. We introduce the novel task of zero-shot warning generation, integrating cross-modal consistency modeling, a lightweight Transformer architecture, and prompt-driven generative synthesis. Our model uses only one-third the parameters of competitive baselines while achieving state-of-the-art detection accuracy. Contribution/Results: To our knowledge, this is the first approach enabling explainable debunking without requiring annotated warning examples—simultaneously delivering high-precision detection and human-interpretable, semantically grounded warnings. Extensive experiments across multiple benchmarks and human evaluations demonstrate that our generated warnings significantly enhance users’ cognitive discernment of false information.
📝 Abstract
The widespread prevalence of misinformation poses significant societal concerns. Out-of-context misinformation, where authentic images are paired with false text, is particularly deceptive and easily misleads audiences. Most existing detection methods primarily evaluate image-text consistency but often lack sufficient explanations, which are essential for effectively debunking misinformation. We present a model that detects multimodal misinformation through cross-modality consistency checks, requiring minimal training time. Additionally, we propose a lightweight model that achieves competitive performance using only one-third of the parameters. We also introduce a dual-purpose zero-shot learning task for generating contextualized warnings, enabling automated debunking and enhancing user comprehension. Qualitative and human evaluations of the generated warnings highlight both the potential and limitations of our approach.