High-Fidelity Image Inpainting with Multimodal Guided GAN Inversion

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing GAN inversion-based image inpainting methods suffer from two key limitations: (1) they fail to enforce pixel-level consistency outside the mask, leading to misalignment between inversion and inpainting objectives; and (2) they rely solely on the original RGB image as input, neglecting complementary structural cues such as semantic edges and texture. To address these issues, we propose MMInvertFill, a multimodal-guided GAN inversion framework. It introduces a novel multimodal encoder jointly modeling semantic edges and texture, a gated mask-aware attention mechanism, and an F&W+ latent-space bridging module to align inversion and inpainting goals. Additionally, we design a Soft-update Mean Latent module to enhance high-fidelity reconstruction under large missing regions. Extensive experiments across six benchmark datasets demonstrate state-of-the-art performance, with significant improvements in structural coherence, color fidelity, and texture realism, while also enabling out-of-domain generalization.

Technology Category

Application Category

📝 Abstract
Generative Adversarial Network (GAN) inversion have demonstrated excellent performance in image inpainting that aims to restore lost or damaged image texture using its unmasked content. Previous GAN inversion-based methods usually utilize well-trained GAN models as effective priors to generate the realistic regions for missing holes. Despite excellence, they ignore a hard constraint that the unmasked regions in the input and the output should be the same, resulting in a gap between GAN inversion and image inpainting and thus degrading the performance. Besides, existing GAN inversion approaches often consider a single modality of the input image, neglecting other auxiliary cues in images for improvements. Addressing these problems, we propose a novel GAN inversion approach, dubbed MMInvertFill, for image inpainting. MMInvertFill contains primarily a multimodal guided encoder with a pre-modulation and a GAN generator with F&W+ latent space. Specifically, the multimodal encoder aims to enhance the multi-scale structures with additional semantic segmentation edge texture modalities through a gated mask-aware attention module. Afterwards, a pre-modulation is presented to encode these structures into style vectors. To mitigate issues of conspicuous color discrepancy and semantic inconsistency, we introduce the F&W+ latent space to bridge the gap between GAN inversion and image inpainting. Furthermore, in order to reconstruct faithful and photorealistic images, we devise a simple yet effective Soft-update Mean Latent module to capture more diversified in-domain patterns for generating high-fidelity textures for massive corruptions. In our extensive experiments on six challenging datasets, we show that our MMInvertFill qualitatively and quantitatively outperforms other state-of-the-arts and it supports the completion of out-of-domain images effectively.
Problem

Research questions and friction points this paper is trying to address.

Ensures unmasked regions remain unchanged during image inpainting
Incorporates multimodal cues for improved image restoration
Reduces color discrepancy and semantic inconsistency in outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal guided encoder enhances structures
F&W+ latent space bridges inversion gap
Soft-update Mean Latent captures diverse patterns
🔎 Similar Papers
L
Libo Zhang
Institute of Software Chinese Academy of Sciences, Beijing, China.
Yongsheng Yu
Yongsheng Yu
University of Rochester
image generation
J
Jiali Yao
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, China.
Heng Fan
Heng Fan
Assistant Professor, University of North Texas
Computer VisionMachine LearningArtificial Intelligence