Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the limitations of existing multimodal media forgery detection methods, which rely on outcome-oriented supervision and consequently struggle to generalize to unseen manipulation types while offering little interpretability. To overcome these challenges, we propose REFORM, a novel framework that introduces, for the first time, a reasoning-driven forensic learning paradigm. REFORM shifts the detection objective from outcome fitting to explicit process modeling through a three-stage curriculum learning strategy, enhanced by a reasoning alignment mechanism and reinforcement learning to optimize logical consistency. To support this approach, we construct ROM, the first large-scale multimodal forgery dataset featuring fine-grained reasoning annotations. Experiments demonstrate that REFORM achieves state-of-the-art performance, with 81.52% accuracy on ROM, 76.65% accuracy on DGM4, and 74.9 F1 on MMFakeBench, significantly improving both generalization and model interpretability.

Technology Category

Application Category

📝 Abstract

Recent advances in generative AI have significantly enhanced the realism of multimodal media manipulation, thereby posing substantial challenges to manipulation detection. Existing manipulation detection and grounding approaches predominantly focus on manipulation type classification under result-oriented supervision, which not only lacks interpretability but also tends to overfit superficial artifacts. In this paper, we argue that generalizable detection requires incorporating explicit forensic reasoning, rather than merely classifying a limited set of manipulation types, which fails to generalize to unseen manipulation patterns. To this end, we propose REFORM, a reasoning-driven framework that shifts learning from outcome fitting to process modeling. REFORM adopts a three-stage curriculum that first induces forensic rationales, then aligns reasoning with final judgments, and finally refines logical consistency via reinforcement learning. To support this paradigm, we introduce ROM, a large-scale dataset with rich reasoning annotations. Extensive experiments show that REFORM establishes new state-of-the-art performance with superior generalization, achieving 81.52% ACC on ROM, 76.65% ACC on DGM4, and 74.9 F1 on MMFakeBench.

Problem

Research questions and friction points this paper is trying to address.

multimodal manipulation detection

generalization

forensic reasoning

media forensics

manipulation detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

forensic reasoning

process modeling

multimodal manipulation detection