🤖 AI Summary
Existing image inpainting methods struggle to simultaneously achieve semantic precision, identity consistency, and contextual naturalness in local editing guided by multimodal prompts (text/image). To address this, we propose a multimodal promptable image inpainting framework built upon a novel three-stage paradigm: Locate–Assign–Refine. First, the target region is precisely localized; second, cross-modal attention is decoupled to enable synergistic weighting of text and image prompts; third, a dedicated RefineNet enhances detail fidelity. We further introduce mask-conditioned noise concatenation and a coarse-to-fine diffusion generation mechanism. Additionally, we design a large-model-based self-supervised multimodal data engine to synthesize high-quality training samples. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in identity preservation and text alignment accuracy, while exhibiting strong robustness and generalization across diverse real-world scenarios.
📝 Abstract
Prior studies have made significant progress in image inpainting guided by either text description or subject image. However, the research on inpainting with flexible guidance or control, i.e., text-only, image-only, and their combination, is still in the early stage. Therefore, in this paper, we introduce the multimodal promptable image inpainting project: a new task model, and data for taming customized image inpainting. We propose LAR-Gen, a novel approach for image inpainting that enables seamless inpainting of specific region in images corresponding to the mask prompt, incorporating both the text prompt and image prompt. Our LAR-Gen adopts a coarse-to-fine manner to ensure the context consistency of source image, subject identity consistency, local semantic consistency to the text description, and smoothness consistency. It consists of three mechanisms: (i) Locate mechanism: concatenating the noise with masked scene image to achieve precise regional editing, (ii) Assign mechanism: employing decoupled cross-attention mechanism to accommodate multi-modal guidance, and (iii) Refine mechanism: using a novel RefineNet to supplement subject details. Additionally, to address the issue of scarce training data, we introduce a novel data engine to automatically extract substantial pairs of data consisting of local text prompts and corresponding visual instances from a vast image data, leveraging publicly available pre-trained large models. Extensive experiments and various application scenarios demonstrate the superiority of LAR-Gen in terms of both identity preservation and text semantic consistency.