🤖 AI Summary
This work addresses the challenge of object duplication artifacts in Stable Diffusion when generating images at resolutions higher than those used during training, a problem inadequately mitigated by existing fine-tuning approaches. The authors propose a neural network weight interpolation method based on kernel interpolation and constant-coefficient scaling, which extends model capability to higher-resolution generation without any retraining. Applicable to both convolutional and fully connected layers, this approach represents the first systematic application of general-purpose interpolation strategies to diffusion model weights. It enables high-quality super-resolution image synthesis with zero training overhead and demonstrates broad architectural generality: across multiple network structures, classification accuracy and F1 scores degrade by no more than 2.6%, while training memory consumption is reduced by at least a factor of four.
📝 Abstract
Stable Diffusion (SD) has evolved DDPM (Denoising Diffusion Probabilistic Model) based image generation significantly by denoising in latent space instead of feature space. This popularized DDPM-based image generation as the cost and compute barrier was significantly lowered. However, these models could only generate fixed-resolution images according to their training configuration. When we attempt to generate higher resolutions, the resulting images show object duplication artifacts consistently. To solve this problem without finetuning SD models, recent works have tried dilating the convolution kernels of the models and have achieved a great level of success. But dilated kernels are harder to fine-tune due to being zero-gapped. Apart from this, other methods, such as patched diffusion, could not solve the object-duplication problem efficiently. Hence, to overcome the limitations of dilated convolutions, we propose kernel interpolation of SD models for higher-resolution image generation. In this work, we show mathematically that interpolation can correctly scale convolution kernels if multiplied by a constant coefficient and achieve competitive empirical results in generating beyond-training-resolution images with Stable Diffusion using zero training. Furthermore, we demonstrate that our method enables interpolation of deep neural networks to adapt to higher-dimensional training data, with a worst-case performance drop of $2.6\%$ in accuracy and F1-Score relative to the baseline. This shows the applicability of our method to be general, where we interpolate fully-connected layers, going beyond convolution layers. We also discuss how we can reduce the memory footprints of training neural networks, using our method up to at least $4\times$.