Towards Explainable Bilingual Multimodal Misinformation Detection and Localization

📅 2025-06-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of detecting bilingual multimodal misinformation (e.g., images with Chinese–English bilingual captions), which exhibits high stealth due to localized image editing and cross-lingual inconsistency, this paper proposes BiMi—a novel framework that jointly models cross-modal and cross-lingual consistency while performing region-level tampering localization and generating interpretable natural language reasoning. Methodologically, BiMi introduces Group Relative Policy Optimization (GRPO) to enhance explanation quality, integrates an online knowledge retrieval module for improved timeliness, and leverages vision–language pretrained models for end-to-end joint inference. Key contributions include: (1) releasing BiMiBench—the first large-scale benchmark for bilingual multimodal misinformation evaluation; and (2) achieving significant improvements over state-of-the-art methods in classification accuracy (+8.9%), localization accuracy (+15.9%), and explanation quality (BERTScore +2.5).

Technology Category

Application Category

📝 Abstract
The increasing realism of multimodal content has made misinformation more subtle and harder to detect, especially in news media where images are frequently paired with bilingual (e.g., Chinese-English) subtitles. Such content often includes localized image edits and cross-lingual inconsistencies that jointly distort meaning while remaining superficially plausible. We introduce BiMi, a bilingual multimodal framework that jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation for misinformation analysis. To support generalization, BiMi integrates an online retrieval module that supplements model reasoning with up-to-date external context. We further release BiMiBench, a large-scale and comprehensive benchmark constructed by systematically editing real news images and subtitles, comprising 104,000 samples with realistic manipulations across visual and linguistic modalities. To enhance interpretability, we apply Group Relative Policy Optimization (GRPO) to improve explanation quality, marking the first use of GRPO in this domain. Extensive experiments demonstrate that BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore, advancing state-of-the-art performance in realistic, multilingual misinformation detection. Code, models, and datasets will be released.
Problem

Research questions and friction points this paper is trying to address.

Detect bilingual multimodal misinformation in news media
Localize image edits and cross-lingual inconsistencies
Improve explainability in misinformation detection models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual multimodal framework for misinformation detection
Online retrieval module for external context integration
Group Relative Policy Optimization for explanation quality
🔎 Similar Papers
No similar papers found.