ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding

๐Ÿ“… 2024-12-17
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 7
โœจ Influential: 1
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing DGM4 methods largely neglect fine-grained cross-modal semantic alignment between images and text, limiting the accuracy of deepfake detection and localization. To address this, we propose a novel semantic alignmentโ€“driven tampering-aware paradigm, introducing the first manipulation-guided cross-attention (MGCA) mechanism. Integrated within MLLM/LLM frameworks, MGCA constructs tampering-enhanced image-text pairs, enabling joint optimization of explicit alignment supervision and implicit manipulation perception. Crucially, our method requires no additional annotations, yet effectively models both local cross-modal semantic consistency and spatial correlation with tampered regions. On the DGM4 benchmark, it achieves significant improvements over state-of-the-art methods in both detection and localization performance. These results underscore the critical role of cross-modal semantic alignment in deepfake understanding and establish an interpretable, generalizable technical pathway for multimodal media authenticity analysis.

Technology Category

Application Category

๐Ÿ“ Abstract
We present ASAP, a new framework for detecting and grounding multi-modal media manipulation (DGM4).Upon thorough examination, we observe that accurate fine-grained cross-modal semantic alignment between the image and text is vital for accurately manipulation detection and grounding. While existing DGM4 methods pay rare attention to the cross-modal alignment, hampering the accuracy of manipulation detecting to step further. To remedy this issue, this work targets to advance the semantic alignment learning to promote this task. Particularly, we utilize the off-the-shelf Multimodal Large-Language Models (MLLMs) and Large Language Models (LLMs) to construct paired image-text pairs, especially for the manipulated instances. Subsequently, a cross-modal alignment learning is performed to enhance the semantic alignment. Besides the explicit auxiliary clues, we further design a Manipulation-Guided Cross Attention (MGCA) to provide implicit guidance for augmenting the manipulation perceiving. With the grounding truth available during training, MGCA encourages the model to concentrate more on manipulated components while downplaying normal ones, enhancing the model's ability to capture manipulations. Extensive experiments are conducted on the DGM4 dataset, the results demonstrate that our model can surpass the comparison method with a clear margin.
Problem

Research questions and friction points this paper is trying to address.

Advancing cross-modal semantic alignment for manipulation detection
Enhancing manipulation grounding through multimodal alignment learning
Improving detection accuracy with manipulation-guided attention mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes MLLMs and LLMs to construct manipulated image-text pairs
Implements cross-modal alignment learning for enhanced semantic alignment
Designs Manipulation-Guided Cross Attention to focus on manipulated components
๐Ÿ”Ž Similar Papers
No similar papers found.