ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image diffusion models often suffer performance degradation in super-resolution generation due to training-resolution constraints; existing training-free methods either incur prohibitive computational overhead or lack compatibility with modern architectures such as Diffusion Transformers. To address this, we propose a model-agnostic, training-free framework for efficient resolution upscaling. Our approach reduces self-attention redundancy via neighborhood block attention, integrates latent-frequency-domain mixing with structural guidance, and is seamlessly embedded into the SDEdit pipeline. The method enables fast inference on both U-Net and Diffusion Transformer backbones. Without any fine-tuning or additional training, it achieves state-of-the-art performance among training-free methods, significantly improving both high-resolution image fidelity and generation efficiency.

Technology Category

Application Category

📝 Abstract
Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution. Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models. In this paper, we propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training. A core component of our framework is Neighborhood Patch Attention (NPA), an efficient mechanism that reduces computational redundancy in the self-attention layer with non-overlapping patches. We integrate NPA into an SDEdit pipeline and introduce Latent Frequency Mixing (LFM) to better generate fine details. Furthermore, we apply Structure Guidance to enhance global structure during the denoising process. Experimental results demonstrate that ScaleDiff achieves state-of-the-art performance among training-free methods in terms of both image quality and inference speed on both U-Net and Diffusion Transformer architectures.
Problem

Research questions and friction points this paper is trying to address.

Extends pretrained diffusion models to higher resolutions without retraining
Reduces computational redundancy in self-attention via patch mechanism
Enhances image quality and speed for training-free super-resolution synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-agnostic framework for higher-resolution image synthesis
Neighborhood Patch Attention reduces computational redundancy efficiently
Latent Frequency Mixing and Structure Guidance enhance details
🔎 Similar Papers
No similar papers found.