Bridging Semantic Logic Gaps: A Cognition-Inspired Multimodal Boundary-Preserving Network for Image Manipulation Localization

πŸ“… 2025-08-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

186K/year
πŸ€– AI Summary
Existing image manipulation localization (IML) methods over-rely on low-level visual cues while neglecting logical consistency in high-level semantic content. To address this, we propose a cognition-inspired multimodal boundary-preserving networkβ€”the first to incorporate textual modality into IML. Our approach leverages large language models to generate semantic prompts and introduces an image-text centroid blurring module to suppress hallucination-induced interference. We further design a correlation matrix-driven cross-modal interaction mechanism and an invertible-network-inspired edge recovery decoder to jointly model semantic consistency and reconstruct tampered boundaries with high fidelity. Evaluated on multiple benchmark datasets, our method achieves significant improvements in both localization accuracy and boundary sharpness. This work establishes a novel paradigm for IML that integrates cognitive priors with synergistic multimodal reasoning.

Technology Category

Application Category

πŸ“ Abstract
The existing image manipulation localization (IML) models mainly relies on visual cues, but ignores the semantic logical relationships between content features. In fact, the content semantics conveyed by real images often conform to human cognitive laws. However, image manipulation technology usually destroys the internal relationship between content features, thus leaving semantic clues for IML. In this paper, we propose a cognition-inspired multimodal boundary-preserving network (CMB-Net). Specifically, CMB-Net utilizes large language models (LLMs) to analyze manipulated regions within images and generate prompt-based textual information to compensate for the lack of semantic relationships in the visual information. Considering that the erroneous texts induced by hallucination from LLMs will damage the accuracy of IML, we propose an image-text central ambiguity module (ITCAM). It assigns weights to the text features by quantifying the ambiguity between text and image features, thereby ensuring the beneficial impact of textual information. We also propose an image-text interaction module (ITIM) that aligns visual and text features using a correlation matrix for fine-grained interaction. Finally, inspired by invertible neural networks, we propose a restoration edge decoder (RED) that mutually generates input and output features to preserve boundary information in manipulated regions without loss. Extensive experiments show that CMB-Net outperforms most existing IML models.
Problem

Research questions and friction points this paper is trying to address.

Identifies manipulated image regions using semantic logic gaps
Compensates visual data with LLM-generated textual cues
Preserves boundary accuracy in tampered areas
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLMs to analyze manipulated image regions
Introduces ITCAM to weight text features accurately
Employs RED to preserve boundary information losslessly
πŸ”Ž Similar Papers
2024-08-05IEEE Transactions on Information Forensics and SecurityCitations: 1
S
Songlin Li
School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China
Z
Zhiqing Guo
School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China
Yuanman Li
Yuanman Li
Associate Professor, Shenzhen University
Multimedia securityMachine learninghttps://yuanmanli.github.io/
Z
Zeyu Li
School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China
Yunfeng Diao
Yunfeng Diao
Assistant Professor, Hefei University of Technology
Adversarial RobustnessComputer VisionAI Security
G
Gaobo Yang
College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
L
Liejun Wang
School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China