🤖 AI Summary
To address the challenge of fine-grained user control over semantic content and boundary fusion strength in image inpainting, this paper proposes ControlFill—a novel framework for interactive, pixel-level manipulation. It introduces a dual-prompt decoupling mechanism that separately models “object generation” and “background extension” semantics, coupled with a pixel-wise spatially variant guidance scaling strategy that enables lightweight, encoder-free control. By jointly optimizing prompt weights and local guidance intensity, ControlFill integrates prompt learning and classifier-free guidance into diffusion-based inpainting, achieving high-fidelity synthesis while significantly enhancing controllability over repaired region semantics, spatial placement, and boundary blending. Experiments demonstrate that ControlFill strikes an exceptional balance between controllability and fidelity, outperforming existing methods in fine-grained user-directed editing.
📝 Abstract
In this report, I present an inpainting framework named extit{ControlFill}, which involves training two distinct prompts: one for generating plausible objects within a designated mask ( extit{creation}) and another for filling the region by extending the background ( extit{removal}). During the inference stage, these learned embeddings guide a diffusion network that operates without requiring heavy text encoders. By adjusting the relative significance of the two prompts and employing classifier-free guidance, users can control the intensity of removal or creation. Furthermore, I introduce a method to spatially vary the intensity of guidance by assigning different scales to individual pixels.