Bridging Semantic Logic Gaps: A Cognition-Inspired Multimodal Boundary-Preserving Network for Image Manipulation Localization

📅 2025-08-10

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Existing image manipulation localization (IML) methods over-rely on low-level visual cues while neglecting logical consistency in high-level semantic content. To address this, we propose a cognition-inspired multimodal boundary-preserving network—the first to incorporate textual modality into IML. Our approach leverages large language models to generate semantic prompts and introduces an image-text centroid blurring module to suppress hallucination-induced interference. We further design a correlation matrix-driven cross-modal interaction mechanism and an invertible-network-inspired edge recovery decoder to jointly model semantic consistency and reconstruct tampered boundaries with high fidelity. Evaluated on multiple benchmark datasets, our method achieves significant improvements in both localization accuracy and boundary sharpness. This work establishes a novel paradigm for IML that integrates cognitive priors with synergistic multimodal reasoning.

Technology Category

Application Category

📝 Abstract

The existing image manipulation localization (IML) models mainly relies on visual cues, but ignores the semantic logical relationships between content features. In fact, the content semantics conveyed by real images often conform to human cognitive laws. However, image manipulation technology usually destroys the internal relationship between content features, thus leaving semantic clues for IML. In this paper, we propose a cognition-inspired multimodal boundary-preserving network (CMB-Net). Specifically, CMB-Net utilizes large language models (LLMs) to analyze manipulated regions within images and generate prompt-based textual information to compensate for the lack of semantic relationships in the visual information. Considering that the erroneous texts induced by hallucination from LLMs will damage the accuracy of IML, we propose an image-text central ambiguity module (ITCAM). It assigns weights to the text features by quantifying the ambiguity between text and image features, thereby ensuring the beneficial impact of textual information. We also propose an image-text interaction module (ITIM) that aligns visual and text features using a correlation matrix for fine-grained interaction. Finally, inspired by invertible neural networks, we propose a restoration edge decoder (RED) that mutually generates input and output features to preserve boundary information in manipulated regions without loss. Extensive experiments show that CMB-Net outperforms most existing IML models.

Problem

Research questions and friction points this paper is trying to address.

Identifies manipulated image regions using semantic logic gaps

Compensates visual data with LLM-generated textual cues

Preserves boundary accuracy in tampered areas

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLMs to analyze manipulated image regions

Introduces ITCAM to weight text features accurately

Employs RED to preserve boundary information losslessly

🔎 Similar Papers

Dense Feature Interaction Network for Image Inpainting Localization