🤖 AI Summary
To address the challenges of scarce high-quality labeled data, insufficient realism and diversity in synthetic data generation, and underexploited potential of large vision-language models (LVLMs) in multimodal misinformation detection—particularly for image–text mislabeling—this paper proposes: (1) the first LVLM-driven paradigm for diverse mislabeled image generation, yielding the high-fidelity synthetic dataset “MisCaption This!”; (2) LAMAR, a latent-space cross-modal reconstruction framework that employs embedding-level reconstruction as strong auxiliary supervision, overcoming limitations of conventional classification-based paradigms; and (3) a multi-strategy architecture integrating mask, gate, and attention mechanisms, jointly optimized via end-to-end pretraining. Our approach establishes new state-of-the-art performance on NewsCLIPpings and VERITE, with significantly improved generalization to real-world scenarios. The code is publicly available.
📝 Abstract
Multimodal misinformation, such as miscaptioned images, where captions misrepresent an image's origin, context, or meaning, poses a growing challenge in the digital age. To support fact-checkers, researchers have been focusing on creating datasets and developing methods for multimodal misinformation detection (MMD). Due to the scarcity of large-scale annotated MMD datasets, recent studies leverage synthetic training data via out-of-context image-caption pairs or named entity manipulations; altering names, dates, and locations. However, these approaches often produce simplistic misinformation that fails to reflect real-world complexity, limiting the robustness of detection models trained on them. Meanwhile, despite recent advancements, Large Vision-Language Models (LVLMs) remain underutilized for generating diverse, realistic synthetic training data for MMD. To address this gap, we introduce"MisCaption This!", a training dataset comprising LVLM-generated miscaptioned images. Additionally, we introduce"Latent Multimodal Reconstruction"(LAMAR), a network trained to reconstruct the embeddings of truthful captions, providing a strong auxiliary signal to the detection process. To optimize LAMAR, we explore different training strategies (end-to-end training and large-scale pre-training) and integration approaches (direct, mask, gate, and attention). Extensive experiments show that models trained on"MisCaption This!"generalize better on real-world misinformation, while LAMAR sets new state-of-the-art on both NewsCLIPpings and VERITE benchmarks; highlighting the potential of LVLM-generated data and reconstruction-based approaches for advancing MMD. We release our code at: https://github.com/stevejpapad/miscaptioned-image-reconstruction