REVEAL: Reference-Grounded Reasoning for Multimodal Manipulation Detection

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the limitations of existing multimodal forgery detection methods, which struggle with imperceptible manipulations and domain shifts while often relying on memorizing isolated artifacts. The authors reformulate the task as a reference-based verification problem, jointly performing authenticity classification and tampering localization by comparing a query sample against evidence retrieved from a large-scale real image-text reference corpus containing 170,000 pairs. To enhance generalization, they introduce a fine-tuning-free domain adaptation mechanism that adapts to new domains solely by updating the reference corpus. Furthermore, a task-decoupled mixture-of-experts architecture and a discrepancy-aware fusion module are designed to mitigate optimization conflicts between detection and localization. Experiments demonstrate that the proposed method significantly outperforms state-of-the-art approaches across multiple benchmarks, exhibiting strong efficiency, robustness, and practicality.

📝 Abstract

Multimodal manipulation detection aims to simultaneously identify forged image--text pairs and localize tampered regions, yet existing methods typically rely on memorizing isolated artifacts and struggle with imperceptible manipulation traces or domain shifts. Inspired by human comparative reasoning, we reformulate this task as a reference-grounded verification problem, where authenticity is assessed by comparing a query against retrieved authentic evidence. We propose REVEAL Reference-Enabled Verification for Evidence Analysis and Localization), a framework explicitly designed for this comparative paradigm. To support this paradigm, we construct a large-scale reference library comprising 170K authentic news image--text pairs featuring over 40K public figures. Technically, REVEAL employs a difference-aware fusion mechanism to capture fine-grained discrepancies between the query and retrieved evidence. Furthermore, we introduce a task-decoupled Mixture-of-Experts (MoE) architecture to jointly execute instance-level detection and fine-grained grounding, effectively mitigating optimization conflicts between these heterogeneous objectives. Extensive experiments demonstrate that REVEAL significantly outperforms state-of-the-art methods, and notably enables \emph{training-free domain adaptation} by simply updating the reference library, offering a robust and practical solution for detecting evolving misinformation. Code is available at https://anonymous.4open.science/r/REVEAL-Reference-A006.

Problem

Research questions and friction points this paper is trying to address.

multimodal manipulation detection

image-text forgery

tampered region localization

domain shift

imperceptible manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

reference-grounded reasoning

difference-aware fusion

task-decoupled Mixture-of-Experts