Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal tampering detection benchmarks artificially disrupt cross-modal semantic alignment, diverging significantly from real-world attacks that maintain visual–textual semantic consistency through coordinated forgeries. This work introduces the novel task of *semantically coordinated multimodal tampering detection*, proposes SAMM—the first semantically aligned, high-fidelity multimodal tampering dataset—and designs RamDG, a retrieval-augmented tampering detection and localization framework that performs cross-modal contextual evidence reasoning via external knowledge retrieval. RamDG employs a two-stage controllable generation pipeline to construct SAMM, integrating image grounding, deepfake identification, and retrieval augmentation modules. On SAMM, RamDG achieves a 2.06% absolute improvement in detection accuracy over state-of-the-art methods, demonstrating substantially enhanced capability in identifying and localizing subtle, semantically consistent multimodal forgeries.

Technology Category

Application Category

📝 Abstract
The detection and grounding of manipulated content in multimodal data has emerged as a critical challenge in media forensics. While existing benchmarks demonstrate technical progress, they suffer from misalignment artifacts that poorly reflect real-world manipulation patterns: practical attacks typically maintain semantic consistency across modalities, whereas current datasets artificially disrupt cross-modal alignment, creating easily detectable anomalies. To bridge this gap, we pioneer the detection of semantically-coordinated manipulations where visual edits are systematically paired with semantically consistent textual descriptions. Our approach begins with constructing the first Semantic-Aligned Multimodal Manipulation (SAMM) dataset, generated through a two-stage pipeline: 1) applying state-of-the-art image manipulations, followed by 2) generation of contextually-plausible textual narratives that reinforce the visual deception. Building on this foundation, we propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework. RamDG commences by harnessing external knowledge repositories to retrieve contextual evidence, which serves as the auxiliary texts and encoded together with the inputs through our image forgery grounding and deep manipulation detection modules to trace all manipulations. Extensive experiments demonstrate our framework significantly outperforms existing methods, achieving 2.06% higher detection accuracy on SAMM compared to state-of-the-art approaches. The dataset and code are publicly available at https://github.com/shen8424/SAMM-RamDG-CAP.
Problem

Research questions and friction points this paper is trying to address.

Detecting semantically-coordinated multimodal manipulation attacks
Grounding manipulated content in aligned visual-textual data
Addressing artificial misalignment in existing manipulation datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-Aligned Multimodal Manipulation dataset construction
Retrieval-Augmented framework using external knowledge repositories
Image forgery grounding with deep manipulation detection modules
🔎 Similar Papers
No similar papers found.