COVE: COntext and VEracity prediction for out-of-context images

📅 2025-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal misinformation arises from the disconnection between images and their textual descriptions—specifically, the absence of authentic contextual metadata (e.g., time, location, event) and the difficulty in verifying caption authenticity. Method: We propose the first decoupled two-stage framework: Stage I precisely reconstructs the image’s original context; Stage II leverages this context to verify caption veracity. Context reconstruction and factual verification are explicitly separated to ensure module interpretability and result reusability—i.e., one reconstructed context can support verification of multiple captions. We adopt a unified multimodal large language model architecture with joint fine-tuning for image-text alignment, context generation, and binary verification. Results: Our method outperforms state-of-the-art approaches by 5.1% (average) in context prediction accuracy; achieves significantly higher caption authenticity classification accuracy on real-world data; and receives strong human evaluation scores for context interpretability and practical reusability.

Technology Category

Application Category

📝 Abstract
Images taken out of their context are the most prevalent form of multimodal misinformation. Debunking them requires (1) providing the true context of the image and (2) checking the veracity of the image's caption. However, existing automated fact-checking methods fail to tackle both objectives explicitly. In this work, we introduce COVE, a new method that predicts first the true COntext of the image and then uses it to predict the VEracity of the caption. COVE beats the SOTA context prediction model on all context items, often by more than five percentage points. It is competitive with the best veracity prediction models on synthetic data and outperforms them on real-world data, showing that it is beneficial to combine the two tasks sequentially. Finally, we conduct a human study that reveals that the predicted context is a reusable and interpretable artifact to verify new out-of-context captions for the same image. Our code and data are made available.
Problem

Research questions and friction points this paper is trying to address.

Detecting out-of-context image misinformation
Predicting true image context automatically
Verifying caption veracity using predicted context
Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts true context of out-of-context images
Uses predicted context to verify caption veracity
Sequentially combines context prediction and veracity checking
🔎 Similar Papers
No similar papers found.