Bridging the Micro--Macro Gap: Frequency-Aware Semantic Alignment for Image Manipulation Localization

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
Existing image manipulation localization methods struggle to simultaneously handle both conventional forgeries and localized, photorealistic edits generated by diffusion models, primarily due to the disconnect between fine-grained forensic cues and high-level semantic understanding. To address this, this work proposes FASA, a unified framework that jointly models frequency-domain artifacts and semantic consistency for the first time. FASA employs an adaptive dual-band DCT to extract frequency-sensitive features, leverages block-wise contrastive learning on a frozen CLIP model to capture semantic priors, and introduces a semantic-frequency side adapter to enable multi-scale interaction. A prototype-guided, frequency-gated mask decoder then predicts tampered regions. The method achieves state-of-the-art performance on OpenSDI and multiple traditional benchmarks, demonstrating exceptional generalization across generators and datasets, as well as robustness to image degradation.

Technology Category

Application Category

📝 Abstract
As generative image editing advances, image manipulation localization (IML) must handle both traditional manipulations with conspicuous forensic artifacts and diffusion-generated edits that appear locally realistic. Existing methods typically rely on either low-level forensic cues or high-level semantics alone, leading to a fundamental micro--macro gap. To bridge this gap, we propose FASA, a unified framework for localizing both traditional and diffusion-generated manipulations. Specifically, we extract manipulation-sensitive frequency cues through an adaptive dual-band DCT module and learn manipulation-aware semantic priors via patch-level contrastive alignment on frozen CLIP representations. We then inject these priors into a hierarchical frequency pathway through a semantic-frequency side adapter for multi-scale feature interaction, and employ a prototype-guided, frequency-gated mask decoder to integrate semantic consistency with boundary-aware localization for tampered region prediction. Extensive experiments on OpenSDI and multiple traditional manipulation benchmarks demonstrate state-of-the-art localization performance, strong cross-generator and cross-dataset generalization, and robust performance under common image degradations.
Problem

Research questions and friction points this paper is trying to address.

image manipulation localization
micro-macro gap
diffusion-generated edits
forensic artifacts
semantic alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

frequency-aware
semantic alignment
image manipulation localization
dual-band DCT
CLIP contrastive learning
🔎 Similar Papers
2024-02-12International Conference on Information PhotonicsCitations: 1
2024-07-26International Workshop on Information Forensics and SecurityCitations: 5
2024-08-05IEEE Transactions on Information Forensics and SecurityCitations: 1