🤖 AI Summary
This work addresses the challenge of detecting multimodal misinformation on social media, which is exacerbated by semantic inconsistencies across modalities and temporal narrative evolution. To this end, the authors propose a unified framework that explicitly models cross-modal inconsistency through a novel integration of modality-specific Mixture-of-Experts, bidirectional co-attention, and a discrepancy-aware branch. Furthermore, a momentum-encoded temporal attention mechanism is introduced to capture the evolving nature of disinformation narratives over time. The model enhances generalization across datasets and domains by incorporating a prototype memory bank and domain-adversarial learning. Extensive experiments on four benchmarks—Fakeddit, MMCoVaR, Weibo, and XFacta—demonstrate that the proposed approach achieves consistently superior performance in terms of accuracy, F1 score, AUC, and Matthews Correlation Coefficient (MCC).
📝 Abstract
The widespread dissemination of multimodal content on social media has made misinformation detection increasingly challenging, as misleading narratives often arise not only from textual or visual content alone, but also from semantic inconsistencies between modalities and their evolution over time. Existing multimodal misinformation detection methods typically model cross-modal interactions statically and often show limited robustness across heterogeneous datasets, domains, and narrative settings. To address these challenges, we propose MOMENTA, a unified framework for multimodal misinformation detection that captures modality heterogeneity, cross-modal inconsistency, temporal dynamics, and cross-domain generalization within a single architecture. MOMENTA employs modality-specific mixture-of-experts modules to model diverse misinformation patterns, bidirectional co-attention to align textual and visual representations in a shared semantic space, and a discrepancy-aware branch to explicitly capture semantic disagreement between modalities. To model narrative evolution, we introduce an attention-based temporal aggregation mechanism with drift and momentum encoding over overlapping time windows, enabling the framework to capture both short-term fluctuations and longer-term trends in misinformation propagation. In addition, domain-adversarial learning and a prototype memory bank improve domain invariance and stabilize representation learning across datasets. The model is trained using a multi-objective optimization strategy that jointly enforces classification performance, cross-modal alignment, contrastive learning, temporal consistency, and domain robustness. Experiments on Fakeddit, MMCoVaR, Weibo, and XFacta show that MOMENTA achieves strong, consistent results across accuracy, F1-score, AUC, and MCC, highlighting its effectiveness for multimodal misinformation detection.