Masked Generative Transformer Is What You Need for Image Editing

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the challenge in diffusion-based image editing where global denoising mechanisms inadvertently alter non-target regions due to tight coupling between edited and unedited areas. To resolve this, the paper introduces EditMGT, the first framework to adapt Masked Generative Transformers (MGT) for image editing. EditMGT leverages MGT’s localized token prediction to strictly confine edits within user-specified masks and incorporates a multi-layer attention fusion module to generate precise spatial guidance signals. Combined with a region-preserving sampling strategy, the approach effectively suppresses unintended modifications outside the target area. Despite using only 0.96 billion parameters, EditMGT achieves state-of-the-art image fidelity across multiple benchmarks and accelerates editing by a factor of six compared to existing methods.

📝 Abstract

Diffusion models dominate image editing, yet their global denoising mechanism entangles edited regions with surrounding context, causing modifications to propagate into areas that should remain intact. We propose a fundamentally different approach by leveraging Masked Generative Transformers (MGTs), whose localized token-prediction paradigm naturally confines changes to intended regions. We present EditMGT, an MGT-based editing framework that is the first of its kind. Our approach employs multi-layer attention consolidation to aggregate cross-attention maps into precise edit localization signals, and region-hold sampling to explicitly prevent token flipping in non-target areas. To support training, we construct CrispEdit-2M, a 2M-sample high-resolution (>1024) editing dataset spanning seven categories. With only 960M parameters, EditMGT achieves state-of-the-art image similarity on multiple benchmarks while delivering 6x faster editing, demonstrating that MGTs offer a compelling alternative to diffusion-based editing.

Problem

Research questions and friction points this paper is trying to address.

image editing

diffusion models

edit localization

region entanglement

context preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked Generative Transformer

localized token prediction

attention consolidation