DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Diffusion Transformers face prohibitively high training and inference costs for ultra-high-resolution image generation due to the quadratic computational complexity of self-attention with respect to token count. To address this, we propose Dynamic Positional Extrapolation (DyPE), a method grounded in the spectral evolution规律 of the diffusion process. DyPE adaptively modulates positional encodings during inference—without fine-tuning or additional parameters—to match the frequency characteristics required at each denoising step. Crucially, it enables zero-cost resolution scaling of pre-trained models far beyond their training resolution (e.g., to 16 MP), without introducing sampling overhead. Experiments demonstrate that DyPE achieves state-of-the-art generative quality across multiple high-resolution benchmarks, with performance gains becoming increasingly pronounced as resolution increases.

Technology Category

Application Category

📝 Abstract

Diffusion Transformer models can generate images with remarkable fidelity and detail, yet training them at ultra-high resolutions remains extremely costly due to the self-attention mechanism's quadratic scaling with the number of image tokens. In this paper, we introduce Dynamic Position Extrapolation (DyPE), a novel, training-free method that enables pre-trained diffusion transformers to synthesize images at resolutions far beyond their training data, with no additional sampling cost. DyPE takes advantage of the spectral progression inherent to the diffusion process, where low-frequency structures converge early, while high-frequencies take more steps to resolve. Specifically, DyPE dynamically adjusts the model's positional encoding at each diffusion step, matching their frequency spectrum with the current stage of the generative process. This approach allows us to generate images at resolutions that exceed the training resolution dramatically, e.g., 16 million pixels using FLUX. On multiple benchmarks, DyPE consistently improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, with gains becoming even more pronounced at higher resolutions. Project page is available at https://noamissachar.github.io/DyPE/.

Problem

Research questions and friction points this paper is trying to address.

Enables pre-trained diffusion transformers to generate ultra-high-resolution images

Addresses quadratic scaling cost of self-attention in high-resolution training

Dynamically adjusts positional encodings to match diffusion process frequencies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Position Extrapolation adjusts positional encoding dynamically

Method leverages spectral progression during diffusion process

Enables ultra-high resolution generation without training or cost

🔎 Similar Papers

SpotDiffusion: A Fast Approach For Seamless Panorama Generation Over Time