SIGMA: Semantic-Difference Instruction-Grounding Mask Annotator for Text-Driven Image Manipulation Localization

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of low-cost, large-scale pixel-level annotations that limits current image manipulation localization (IML) methods. The authors propose a novel automatic annotation framework that obviates manual masks by efficiently extracting high-quality localization supervision from publicly available text-driven edited image pairs. Their approach leverages vision foundation models to compute semantic feature discrepancies, integrates instruction-guided spatial priors, and introduces several key innovations—including bidirectional cross-modal refinement, VAE round-trip noise calibration, EMA-based self-training, and an editing-noise disentanglement loss—to effectively bridge the domain gap between diffusion-based image editing and IML training. Evaluated on five benchmarks, the method substantially outperforms existing approaches (+12.20% F1, +11.16% IoU) and yields a 1.1-million-sample IML training set that boosts the average F1 score of six detectors by 18.34%.
📝 Abstract
Text-driven image editing has advanced rapidly, but reliably localizing these manipulations requires image manipulation localization (IML) models trained on large pixel-annotated datasets, and there is still no low-cost way to obtain such training data at scale. We observe that these data already exist in disguise: public editing datasets contain millions of structurally identical (original, edited) pairs to IML training samples, lacking only pixel-level masks. Recovering these masks automatically is non-trivial: pixel differencing is overwhelmed by diffusion-induced perturbations across all pixels, and instruction-only grounding localizes only what the prompt describes, missing unintended editor side-effects. We propose SIGMA (Semantic-difference Instruction-Grounding Mask Annotator), which performs semantic-feature differencing in a vision foundation backbone and injects an instruction-derived spatial prior into this visual stream via bidirectional cross-modal refinement, amplifying the difference signal at intended-edit regions when the editor faithfully realizes user intent. SIGMA is trained in two complementary stages: Stage I supervises on inpainting masks; Stage II closes the diffusion-domain shift via VAE-roundtrip noise calibration, EMA self-training, and an edit-noise disentanglement loss. SIGMA outperforms existing automatic mask generators on five benchmarks (+12.20% F1, +11.16% IoU). When applied to public editing corpora, it produces a ~1.1M IML training set that improves six diverse detectors by +18.34% F1 across five datasets, turning previously unused editing data into a model-agnostic supervisory resource for IML. We'll release the full codebase as soon as the paper is accepted.
Problem

Research questions and friction points this paper is trying to address.

image manipulation localization
pixel-level annotation
text-driven image editing
mask generation
training data scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic-difference
instruction-grounding
mask annotation
diffusion-based editing
cross-modal refinement