EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing

📅 2025-12-12

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Diffusion models often distort non-target regions in image editing due to global denoising. This paper proposes the first localized image editing framework based on Masked Generative Transformers (MGT), which avoids global perturbations via token-level mask prediction, enabling precise region editing and strong fidelity preservation. Key contributions include: (1) the first adaptation of MGT to image editing; (2) a cross-layer attention map fusion mechanism to enhance spatial localization accuracy; and (3) region-preserving sampling and attention-injection fine-tuning—parameter-free strategies that adapt pre-trained MGTs without adding parameters. Evaluated on four benchmarks, our method matches diffusion models in editing quality while accelerating inference by 6×; it improves style transfer and domain adaptation by 3.6% and 17.6%, respectively, with <1B parameters. We also introduce CrispEdit-2M, a high-resolution image editing dataset.

Technology Category

Application Category

📝 Abstract

Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct CrispEdit-2M, a high-resolution dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves similarity performance while enabling 6 times faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.

Problem

Research questions and friction points this paper is trying to address.

Addresses unintended modifications in non-target regions during image editing

Introduces a localized decoding approach using Masked Generative Transformers for precise edits

Enhances editing quality and speed while preserving integrity of surrounding areas

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Masked Generative Transformers for localized image editing

Introduces region-hold sampling to restrict edits to target areas

Adapts pre-trained MGT via attention injection without extra parameters

🔎 Similar Papers

No similar papers found.