🤖 AI Summary
Existing few-step diffusion-based image inpainting methods often initialize with random Gaussian noise, which frequently leads to semantic inconsistencies between the inpainted region and the background, as well as visible artifacts, making it challenging to balance efficiency and quality. To address this, we propose InverFill, a one-shot inversion-based inpainting approach that requires neither retraining nor ground-truth image supervision. InverFill constructs a semantically aligned initial noise by extracting contextual information from the masked input and integrates it into a unified pipeline combining one-step inversion with fusion sampling. This enables high-quality, text-consistent inpainting at extremely low NFE (number of function evaluations). Experiments demonstrate that InverFill substantially outperforms the original fusion sampling method with negligible inference overhead, achieving performance on par with specialized inpainting models.
📝 Abstract
Recent diffusion-based models achieve photorealism in image inpainting but require many sampling steps, limiting practical use. Few-step text-to-image models offer faster generation, but naively applying them to inpainting yields poor harmonization and artifacts between the background and inpainted region. We trace this cause to random Gaussian noise initialization, which under low function evaluations causes semantic misalignment and reduced fidelity. To overcome this, we propose InverFill, a one-step inversion method tailored for inpainting that injects semantic information from the input masked image into the initial noise, enabling high-fidelity few-step inpainting. Instead of training inpainting models, InverFill leverages few-step text-to-image models in a blended sampling pipeline with semantically aligned noise as input, significantly improving vanilla blended sampling and even matching specialized inpainting models at low NFEs. Moreover, InverFill does not require real-image supervision and only adds minimal inference overhead. Extensive experiments show that InverFill consistently boosts baseline few-step models, improving image quality and text coherence without costly retraining or heavy iterative optimization.