UltraImage: Rethinking Resolution Extrapolation in Image Diffusion Transformers

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Image diffusion Transformers suffer from content repetition and quality degradation during extreme-resolution super-resolution extrapolation (e.g., beyond 4K). This paper presents the first systematic framework tailored for such high-fidelity extrapolation. First, a frequency-domain analysis based on positional embeddings reveals that repetition stems from uncontrolled dominant frequencies in the latent representation. Second, we propose a recursive dominant-frequency correction module to suppress periodic artifacts. Third, we introduce an entropy-guided adaptive attention focusing mechanism that dynamically balances local detail reconstruction with global structural consistency, augmented by attention sparsification for computational efficiency. Evaluated on Qwen-Image and Flux architectures, our method enables unconditional generation of 6K×6K images—without low-resolution guidance—achieving state-of-the-art improvements: significantly reduced repetition rates, enhanced visual fidelity, and superior structural coherence.

Technology Category

Application Category

📝 Abstract

Recent image diffusion transformers achieve high-fidelity generation, but struggle to generate images beyond these scales, suffering from content repetition and quality degradation. In this work, we present UltraImage, a principled framework that addresses both issues. Through frequency-wise analysis of positional embeddings, we identify that repetition arises from the periodicity of the dominant frequency, whose period aligns with the training resolution. We introduce a recursive dominant frequency correction to constrain it within a single period after extrapolation. Furthermore, we find that quality degradation stems from diluted attention and thus propose entropy-guided adaptive attention concentration, which assigns higher focus factors to sharpen local attention for fine detail and lower ones to global attention patterns to preserve structural consistency. Experiments show that UltraImage consistently outperforms prior methods on Qwen-Image and Flux (around 4K) across three generation scenarios, reducing repetition and improving visual fidelity. Moreover, UltraImage can generate images up to 6K*6K without low-resolution guidance from a training resolution of 1328p, demonstrating its extreme extrapolation capability. Project page is available at href{https://thu-ml.github.io/ultraimage.github.io/}{https://thu-ml.github.io/ultraimage.github.io/}.

Problem

Research questions and friction points this paper is trying to address.

Addresses content repetition in image diffusion transformers during resolution extrapolation

Solves quality degradation from diluted attention in high-resolution image generation

Enables extreme resolution extrapolation beyond training scales without low-resolution guidance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recursive dominant frequency correction for resolution extrapolation

Entropy-guided adaptive attention concentration for detail enhancement

Generating high-resolution images up to 6K without low-resolution guidance

🔎 Similar Papers

No similar papers found.