SAPL: Semantic-Agnostic Prompt Learning in CLIP for Weakly Supervised Image Manipulation Localization

📅 2026-01-09
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing weakly supervised image manipulation localization methods rely on image-level labels and struggle to capture critical boundary cues, limiting their localization accuracy. This work proposes a Semantics-Agnostic Prompt Learning (SAPL) framework that, for the first time, introduces a boundary-aware mechanism into CLIP. By leveraging Edge-aware Contextual Prompt Learning (ECPL) and Hierarchical Edge Contrastive Learning (HECL), SAPL steers the model to focus on local edge characteristics of manipulated regions rather than high-level semantics. The approach jointly models edge information in both textual and visual spaces, effectively uncovering fine-grained tampering clues under weak supervision. Experiments demonstrate that SAPL significantly outperforms current state-of-the-art methods across multiple public benchmarks, achieving the best reported performance in weakly supervised image manipulation localization.

Technology Category

Application Category

📝 Abstract
Malicious image manipulation threatens public safety and requires efficient localization methods. Existing approaches depend on costly pixel-level annotations which make training expensive. Existing weakly supervised methods rely only on image-level binary labels and focus on global classification, often overlooking local edge cues that are critical for precise localization. We observe that feature variations at manipulated boundaries are substantially larger than in interior regions. To address this gap, we propose Semantic-Agnostic Prompt Learning (SAPL) in CLIP, which learns text prompts that intentionally encode non-semantic, boundary-centric cues so that CLIPs multimodal similarity highlights manipulation edges rather than high-level object semantics. SAPL combines two complementary modules Edge-aware Contextual Prompt Learning (ECPL) and Hierarchical Edge Contrastive Learning (HECL) to exploit edge information in both textual and visual spaces. The proposed ECPL leverages edge-enhanced image features to generate learnable textual prompts via an attention mechanism, embedding semantic-irrelevant information into text features, to guide CLIP focusing on manipulation edges. The proposed HECL extract genuine and manipulated edge patches, and utilize contrastive learning to boost the discrimination between genuine edge patches and manipulated edge patches. Finally, we predict the manipulated regions from the similarity map after processing. Extensive experiments on multiple public benchmarks demonstrate that SAPL significantly outperforms existing approaches, achieving state-of-the-art localization performance.
Problem

Research questions and friction points this paper is trying to address.

image manipulation localization
weakly supervised learning
edge cues
CLIP
prompt learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-Agnostic Prompt Learning
Weakly Supervised Localization
CLIP
Edge-aware Prompting
Contrastive Learning