🤖 AI Summary
Existing GAN inversion-based image inpainting methods suffer from two key limitations: (1) they fail to enforce pixel-level consistency outside the mask, leading to misalignment between inversion and inpainting objectives; and (2) they rely solely on the original RGB image as input, neglecting complementary structural cues such as semantic edges and texture. To address these issues, we propose MMInvertFill, a multimodal-guided GAN inversion framework. It introduces a novel multimodal encoder jointly modeling semantic edges and texture, a gated mask-aware attention mechanism, and an F&W+ latent-space bridging module to align inversion and inpainting goals. Additionally, we design a Soft-update Mean Latent module to enhance high-fidelity reconstruction under large missing regions. Extensive experiments across six benchmark datasets demonstrate state-of-the-art performance, with significant improvements in structural coherence, color fidelity, and texture realism, while also enabling out-of-domain generalization.
📝 Abstract
Generative Adversarial Network (GAN) inversion have demonstrated excellent performance in image inpainting that aims to restore lost or damaged image texture using its unmasked content. Previous GAN inversion-based methods usually utilize well-trained GAN models as effective priors to generate the realistic regions for missing holes. Despite excellence, they ignore a hard constraint that the unmasked regions in the input and the output should be the same, resulting in a gap between GAN inversion and image inpainting and thus degrading the performance. Besides, existing GAN inversion approaches often consider a single modality of the input image, neglecting other auxiliary cues in images for improvements. Addressing these problems, we propose a novel GAN inversion approach, dubbed MMInvertFill, for image inpainting. MMInvertFill contains primarily a multimodal guided encoder with a pre-modulation and a GAN generator with F&W+ latent space. Specifically, the multimodal encoder aims to enhance the multi-scale structures with additional semantic segmentation edge texture modalities through a gated mask-aware attention module. Afterwards, a pre-modulation is presented to encode these structures into style vectors. To mitigate issues of conspicuous color discrepancy and semantic inconsistency, we introduce the F&W+ latent space to bridge the gap between GAN inversion and image inpainting. Furthermore, in order to reconstruct faithful and photorealistic images, we devise a simple yet effective Soft-update Mean Latent module to capture more diversified in-domain patterns for generating high-fidelity textures for massive corruptions. In our extensive experiments on six challenging datasets, we show that our MMInvertFill qualitatively and quantitatively outperforms other state-of-the-arts and it supports the completion of out-of-domain images effectively.