🤖 AI Summary
Real-world scenarios often involve coexisting multi-source manipulations, yet existing fake news detection methods predominantly assume single-source, single-modality falsifications and lack benchmarks for mixed-source multimodal misinformation.
Method: We introduce MMFakeBench—the first mixed-source multimodal fake news detection benchmark—covering three distortion categories (textual, visual, and cross-modal inconsistency) and 12 fine-grained subtypes, enabling zero-shot evaluation of large vision-language models (LVLMs) and dedicated detectors. We formally define mixed-source multimodal misinformation and propose MMD-Agent, a LVLM-based agent framework featuring multi-step reasoning and tool-augmented inference for fine-grained distortion modeling and generalized detection.
Results: Evaluating 15 LVLMs and 6 detection methods on MMFakeBench reveals substantial performance degradation under mixed-source conditions. MMD-Agent achieves an average accuracy gain of 12.7% and significantly improves cross-distortion generalization.
📝 Abstract
Current multimodal misinformation detection (MMD) methods often assume a single source and type of forgery for each sample, which is insufficient for real-world scenarios where multiple forgery sources coexist. The lack of a benchmark for mixed-source misinformation has hindered progress in this field. To address this, we introduce MMFakeBench, the first comprehensive benchmark for mixed-source MMD. MMFakeBench includes 3 critical sources: textual veracity distortion, visual veracity distortion, and cross-modal consistency distortion, along with 12 sub-categories of misinformation forgery types. We further conduct an extensive evaluation of 6 prevalent detection methods and 15 Large Vision-Language Models (LVLMs) on MMFakeBench under a zero-shot setting. The results indicate that current methods struggle under this challenging and realistic mixed-source MMD setting. Additionally, we propose MMD-Agent, a novel approach to integrate the reasoning, action, and tool-use capabilities of LVLM agents, significantly enhancing accuracy and generalization. We believe this study will catalyze future research into more realistic mixed-source multimodal misinformation and provide a fair evaluation of misinformation detection methods.