🤖 AI Summary
To address structural distortion, edge blurring, and semantic inconsistency in face inpainting under large-scale irregular occlusions, this paper proposes a semantic-guided two-stage generative framework. In the first stage, a hybrid CNN-ViT encoder generates a geometrically and semantically coherent facial layout without requiring prior knowledge of occlusion shape. In the second stage, a multimodal texture generator synthesizes high-fidelity textures via dynamic attention mechanisms. The method supports arbitrary mask inputs and introduces a hybrid perceptual loss integrating LPIPS, PSNR, and SSIM for robust optimization and evaluation. Extensive experiments on CelebA-HQ and FFHQ demonstrate significant improvements over state-of-the-art methods, particularly in semantic coherence, identity preservation, and visual realism for large-occlusion scenarios.
📝 Abstract
Facial Image inpainting aim is to restore the missing or corrupted regions in face images while preserving identity, structural consistency and photorealistic image quality, a task specifically created for photo restoration. Though there are recent lot of advances in deep generative models, existing methods face problems with large irregular masks, often producing blurry textures on the edges of the masked region, semantic inconsistencies, or unconvincing facial structures due to direct pixel level synthesis approach and limited exploitation of facial priors. In this paper we propose a novel architecture, which address these above challenges through semantic-guided hierarchical synthesis. Our approach starts with a method that organizes and synthesizes information based on meaning, followed by refining the texture. This process gives clear insights into the facial structure before we move on to creating detailed images. In the first stage, we blend two techniques: one that focuses on local features with CNNs and global features with Vision Transformers. This helped us create clear and detailed semantic layouts. In the second stage, we use a Multi-Modal Texture Generator to refine these layouts by pulling in information from different scales, ensuring everything looks cohesive and consistent. The architecture naturally handles arbitrary mask configurations through dynamic attention without maskspecific training. Experiment on two datasets CelebA-HQ and FFHQ shows that our model outperforms other state-of-the-art methods, showing improvements in metrics like LPIPS, PSNR, and SSIM. It produces visually striking results with better semantic preservation, in challenging large-area inpainting situations.