🤖 AI Summary
Blind restoration of low-quality (LQ) face images remains challenging for diffusion models, as their VAE encoders—trained exclusively on high-quality data—fail to accurately model semantic representations of LQ inputs. To address this, we propose the Lightweight Latent Codebook Alignment Adapter (LAFR), enabling latent-space alignment without retraining the VAE. We further introduce a novel multi-level identity-structure joint constraint loss, integrating identity embedding supervision with facial geometric priors, and design an efficient diffusion prior fine-tuning strategy. Using only 0.9% of the FFHQ dataset, our method achieves state-of-the-art performance on both synthetic and real-world degraded benchmarks: PSNR and SSIM improve significantly; identity similarity increases by 12.6%; training time is reduced by 70%; and high-fidelity, efficient reconstruction is enabled even for severely degraded faces.
📝 Abstract
Blind face restoration from low-quality (LQ) images is a challenging task that requires not only high-fidelity image reconstruction but also the preservation of facial identity. While diffusion models like Stable Diffusion have shown promise in generating high-quality (HQ) images, their VAE modules are typically trained only on HQ data, resulting in semantic misalignment when encoding LQ inputs. This mismatch significantly weakens the effectiveness of LQ conditions during the denoising process. Existing approaches often tackle this issue by retraining the VAE encoder, which is computationally expensive and memory-intensive. To address this limitation efficiently, we propose LAFR (Latent Alignment for Face Restoration), a novel codebook-based latent space adapter that aligns the latent distribution of LQ images with that of HQ counterparts, enabling semantically consistent diffusion sampling without altering the original VAE. To further enhance identity preservation, we introduce a multi-level restoration loss that combines constraints from identity embeddings and facial structural priors. Additionally, by leveraging the inherent structural regularity of facial images, we show that lightweight finetuning of diffusion prior on just 0.9% of FFHQ dataset is sufficient to achieve results comparable to state-of-the-art methods, reduce training time by 70%. Extensive experiments on both synthetic and real-world face restoration benchmarks demonstrate the effectiveness and efficiency of LAFR, achieving high-quality, identity-preserving face reconstruction from severely degraded inputs.