Locate, Assign, Refine: Taming Customized Promptable Image Inpainting

📅 2024-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image inpainting methods struggle to simultaneously achieve semantic precision, identity consistency, and contextual naturalness in local editing guided by multimodal prompts (text/image). To address this, we propose a multimodal promptable image inpainting framework built upon a novel three-stage paradigm: Locate–Assign–Refine. First, the target region is precisely localized; second, cross-modal attention is decoupled to enable synergistic weighting of text and image prompts; third, a dedicated RefineNet enhances detail fidelity. We further introduce mask-conditioned noise concatenation and a coarse-to-fine diffusion generation mechanism. Additionally, we design a large-model-based self-supervised multimodal data engine to synthesize high-quality training samples. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in identity preservation and text alignment accuracy, while exhibiting strong robustness and generalization across diverse real-world scenarios.

Technology Category

Application Category

📝 Abstract
Prior studies have made significant progress in image inpainting guided by either text description or subject image. However, the research on inpainting with flexible guidance or control, i.e., text-only, image-only, and their combination, is still in the early stage. Therefore, in this paper, we introduce the multimodal promptable image inpainting project: a new task model, and data for taming customized image inpainting. We propose LAR-Gen, a novel approach for image inpainting that enables seamless inpainting of specific region in images corresponding to the mask prompt, incorporating both the text prompt and image prompt. Our LAR-Gen adopts a coarse-to-fine manner to ensure the context consistency of source image, subject identity consistency, local semantic consistency to the text description, and smoothness consistency. It consists of three mechanisms: (i) Locate mechanism: concatenating the noise with masked scene image to achieve precise regional editing, (ii) Assign mechanism: employing decoupled cross-attention mechanism to accommodate multi-modal guidance, and (iii) Refine mechanism: using a novel RefineNet to supplement subject details. Additionally, to address the issue of scarce training data, we introduce a novel data engine to automatically extract substantial pairs of data consisting of local text prompts and corresponding visual instances from a vast image data, leveraging publicly available pre-trained large models. Extensive experiments and various application scenarios demonstrate the superiority of LAR-Gen in terms of both identity preservation and text semantic consistency.
Problem

Research questions and friction points this paper is trying to address.

Image Inpainting
Multimodal Information
Text-and-Image Prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

LAR-Gen
Multi-modal Prompted Image Inpainting
Automatic Data Collection
🔎 Similar Papers
No similar papers found.