A$^2$-Edit: Precise Reference-Guided Image Editing of Arbitrary Objects and Ambiguous Masks

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses key limitations of existing image editing methods—namely, homogeneous outputs, insufficient category coverage, and reliance on high-precision masks—by proposing a unified inpainting framework capable of replacing target regions with arbitrary reference objects using only coarse masks. The core innovations include a hybrid Transformer architecture that enables dynamic expert selection for modeling diverse object categories and a Mask Annealing Training Strategy (MATS) to enhance robustness against imprecise masks. To further improve generalization, the authors introduce UniEdit-500K, a large-scale, multi-category dataset. Extensive experiments demonstrate that the proposed method significantly outperforms state-of-the-art approaches on benchmarks such as VITON-HD and AnyInsertion, achieving both high fidelity and strong generalization in arbitrary object editing tasks.

Technology Category

Application Category

📝 Abstract

We propose \textbf{A$^2$-Edit}, a unified inpainting framework for arbitrary object categories, which allows users to replace any target region with a reference object using only a coarse mask. To address the issues of severe homogenization and limited category coverage in existing datasets, we construct a large-scale, multi-category dataset \textbf{UniEdit-500K}, which includes 8 major categories, 209 fine-grained subcategories, and a total of 500,104 image pairs. Such rich category diversity poses new challenges for the model, requiring it to automatically learn semantic relationships and distinctions across categories. To this end, we introduce the \textbf{Mixture of Transformer} module, which performs differentiated modeling of various object categories through dynamic expert selection, and further enhances cross-category semantic transfer and generalization through collaboration among experts. In addition, we propose a \textbf{Mask Annealing Training Strategy} (MATS) that progressively relaxes mask precision during training, reducing the model's reliance on accurate masks and improving robustness across diverse editing tasks. Extensive experiments on benchmarks such as VITON-HD and AnyInsertion demonstrate that A$^2$-Edit consistently outperforms existing approaches across all metrics, providing a new and efficient solution for arbitrary object editing.

Problem

Research questions and friction points this paper is trying to address.

image editing

arbitrary objects

ambiguous masks

category diversity

reference-guided

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Transformer

Mask Annealing Training Strategy

reference-guided image editing