Exposing Cross-Modal Consistency for Fake News Detection in Short-Form Videos

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of detecting multimodal fake news in short videos, where individual modalities appear plausible yet exhibit subtle cross-modal inconsistencies. To tackle this, the authors propose MAGIC3, a novel framework that explicitly models multi-level consistency among text, visual, and audio modalities. By integrating a cross-modal attention mechanism to extract fine-grained alignment features, and enhancing robustness through an uncertainty-aware classifier and multi-style large language model rewriting, MAGIC3 achieves high detection accuracy. Furthermore, a selective vision-language model routing strategy enables the system to match the performance of state-of-the-art vision-language models on the FakeSV and FakeTT datasets while improving inference throughput by 18–27× and reducing GPU memory usage by 93%.

Technology Category

Application Category

📝 Abstract
Short-form video platforms are major channels for news but also fertile ground for multimodal misinformation where each modality appears plausible alone yet cross-modal relationships are subtly inconsistent, like mismatched visuals and captions. On two benchmark datasets, FakeSV (Chinese) and FakeTT (English), we observe a clear asymmetry: real videos exhibit high text-visual but moderate text-audio consistency, while fake videos show the opposite pattern. Moreover, a single global consistency score forms an interpretable axis along which fake probability and prediction errors vary smoothly. Motivated by these observations, we present MAGIC3 (Modal-Adversarial Gated Interaction and Consistency-Centric Classifier), a detector that explicitly models and exposes cross-tri-modal consistency signals at multiple granularities. MAGIC3 combines explicit pairwise and global consistency modeling with token- and frame-level consistency signals derived from cross-modal attention, incorporates multi-style LLM rewrites to obtain style-robust text representations, and employs an uncertainty-aware classifier for selective VLM routing. Using pre-extracted features, MAGIC3 consistently outperforms the strongest non-VLM baselines on FakeSV and FakeTT. While matching VLM-level accuracy, the two-stage system achieves 18-27x higher throughput and 93% VRAM savings, offering a strong cost-performance tradeoff.
Problem

Research questions and friction points this paper is trying to address.

fake news detection
short-form videos
cross-modal inconsistency
multimodal misinformation
text-visual-audio consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal consistency
multimodal fake news detection
MAGIC3
uncertainty-aware classification
style-robust LLM rewriting
🔎 Similar Papers
No similar papers found.
C
Chong Tian
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Y
Yu Wang
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Chenxu Yang
Chenxu Yang
Institute of Information Engineering, Chinese Academy of Sciences
NLPDialogue Generation
J
Junyi Guan
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Zheng Lin
Zheng Lin
Institute of Information Engineering, CAS
NLP
Y
Yuhan Liu
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Xiuying Chen
Xiuying Chen
MBZUAI
Trustworthy NLPHuman-Centered NLPComputational Social Science
Qirong Ho
Qirong Ho
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) and Petuum, Inc