DIVER: Dynamic Iterative Visual Evidence Reasoning for Multimodal Fake News Detection

📅 2026-01-12

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses key challenges in multimodal fake news detection—namely, computational redundancy in static fusion, hallucinations in large language models, and insufficient visual grounding—by proposing a dynamic, iterative visual evidence reasoning framework. The approach first establishes a reliable textual baseline and adaptively invokes fine-grained visual tools (e.g., OCR and dense image captioning) only when necessary, guided by cross-modal alignment verification. Multimodal fusion and iterative refinement are further enhanced through an uncertainty-aware mechanism. This study pioneers an on-demand, evidence-driven multimodal reasoning paradigm, achieving an average performance gain of 2.72% over state-of-the-art methods across the Weibo, Weibo21, and GossipCop datasets while reducing inference latency to 4.12 seconds, thus balancing accuracy and efficiency.

Technology Category

Application Category

📝 Abstract

Multimodal fake news detection is crucial for mitigating adversarial misinformation. Existing methods, relying on static fusion or LLMs, face computational redundancy and hallucination risks due to weak visual foundations. To address this, we propose DIVER (Dynamic Iterative Visual Evidence Reasoning), a framework grounded in a progressive, evidence-driven reasoning paradigm. DIVER first establishes a strong text-based baseline through language analysis, leveraging intra-modal consistency to filter unreliable or hallucinated claims. Only when textual evidence is insufficient does the framework introduce visual information, where inter-modal alignment verification adaptively determines whether deeper visual inspection is necessary. For samples exhibiting significant cross-modal semantic discrepancies, DIVER selectively invokes fine-grained visual tools (e.g., OCR and dense captioning) to extract task-relevant evidence, which is iteratively aggregated via uncertainty-aware fusion to refine multimodal reasoning. Experiments on Weibo, Weibo21, and GossipCop demonstrate that DIVER outperforms state-of-the-art baselines by an average of 2.72\%, while optimizing inference efficiency with a reduced latency of 4.12 s.

Problem

Research questions and friction points this paper is trying to address.

multimodal fake news detection

computational redundancy

hallucination risks

visual foundation

cross-modal semantic discrepancies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Iterative Reasoning

Evidence-Driven Fusion

Cross-Modal Alignment