🤖 AI Summary
Existing stable diffusion (SD)-based methods for real-world image super-resolution (Real-ISR) suffer from insufficient recovery of fine structures—e.g., small text and intricate textures—due to the 8× downsampling in their variational autoencoder (VAE), which severely attenuates high-frequency details.
Method: We propose a transfer-based VAE training strategy that progressively reconstructs the encoder-decoder architecture, enabling lossless migration of the pre-trained SD model’s 8× VAE into a 4× VAE—fully preserving UNet compatibility. Additionally, we design a lightweight compact VAE and an efficient UNet module to reduce computational overhead.
Contribution/Results: Our method achieves state-of-the-art performance on Real-ISR, with lower FLOPs than mainstream single-step diffusion models. It simultaneously enhances fine-grained structural fidelity and inference efficiency, demonstrating superior detail reconstruction without compromising architectural compatibility or computational scalability.
📝 Abstract
Impressive results on real-world image super-resolution (Real-ISR) have been achieved by employing pre-trained stable diffusion (SD) models. However, one critical issue of such methods lies in their poor reconstruction of image fine structures, such as small characters and textures, due to the aggressive resolution reduction of the VAE (eg., 8$ imes$ downsampling) in the SD model. One solution is to employ a VAE with a lower downsampling rate for diffusion; however, adapting its latent features with the pre-trained UNet while mitigating the increased computational cost poses new challenges. To address these issues, we propose a Transfer VAE Training (TVT) strategy to transfer the 8$ imes$ downsampled VAE into a 4$ imes$ one while adapting to the pre-trained UNet. Specifically, we first train a 4$ imes$ decoder based on the output features of the original VAE encoder, then train a 4$ imes$ encoder while keeping the newly trained decoder fixed. Such a TVT strategy aligns the new encoder-decoder pair with the original VAE latent space while enhancing image fine details. Additionally, we introduce a compact VAE and compute-efficient UNet by optimizing their network architectures, reducing the computational cost while capturing high-resolution fine-scale features. Experimental results demonstrate that our TVT method significantly improves fine-structure preservation, which is often compromised by other SD-based methods, while requiring fewer FLOPs than state-of-the-art one-step diffusion models. The official code can be found at https://github.com/Joyies/TVT.