π€ AI Summary
Existing image manipulation localization (IML) methods over-rely on low-level visual cues while neglecting logical consistency in high-level semantic content. To address this, we propose a cognition-inspired multimodal boundary-preserving networkβthe first to incorporate textual modality into IML. Our approach leverages large language models to generate semantic prompts and introduces an image-text centroid blurring module to suppress hallucination-induced interference. We further design a correlation matrix-driven cross-modal interaction mechanism and an invertible-network-inspired edge recovery decoder to jointly model semantic consistency and reconstruct tampered boundaries with high fidelity. Evaluated on multiple benchmark datasets, our method achieves significant improvements in both localization accuracy and boundary sharpness. This work establishes a novel paradigm for IML that integrates cognitive priors with synergistic multimodal reasoning.
π Abstract
The existing image manipulation localization (IML) models mainly relies on visual cues, but ignores the semantic logical relationships between content features. In fact, the content semantics conveyed by real images often conform to human cognitive laws. However, image manipulation technology usually destroys the internal relationship between content features, thus leaving semantic clues for IML. In this paper, we propose a cognition-inspired multimodal boundary-preserving network (CMB-Net). Specifically, CMB-Net utilizes large language models (LLMs) to analyze manipulated regions within images and generate prompt-based textual information to compensate for the lack of semantic relationships in the visual information. Considering that the erroneous texts induced by hallucination from LLMs will damage the accuracy of IML, we propose an image-text central ambiguity module (ITCAM). It assigns weights to the text features by quantifying the ambiguity between text and image features, thereby ensuring the beneficial impact of textual information. We also propose an image-text interaction module (ITIM) that aligns visual and text features using a correlation matrix for fine-grained interaction. Finally, inspired by invertible neural networks, we propose a restoration edge decoder (RED) that mutually generates input and output features to preserve boundary information in manipulated regions without loss. Extensive experiments show that CMB-Net outperforms most existing IML models.